Skip to content

LLM and hardware

Overview

  • Computations in LLMs
  • LLM on super-comupters

Computations in LLMs

Neural networks

  • Learn patterns by adjusting parameters (weights);
  • Training = prediction → differentiation → update;
  • So far: mini-batch & optimizer & big → good.

Transformer

  • Transformer computes relationships between tokens (attention);
  • tokens can be processed in parallel

Training of LLMs

  • Just neural networkes that can be parallelized more efficiently;

Fine-tuninig of LLMs

  • With specialized data (instruct, chat, etc);
  • Less memory usage by "freezing parameters"

Inference of LLMs

  • GPT-style inference: pre-filling and decoding;
  • Pre-filling: process the input prompt in parallel;
  • Decoding: generate new tokens one-by-one, using cached results.

Optimize caches for inference

  • KV cache:
  • paged attention: indexed blockes of caches;
  • flash attention: fuse operations to reduce caches;

more in-depth discussion of the technique where that visualization is from: paged attention from first principles.

Key takeaway

  • LLMs/NNs benefit from massive parallelization;
  • Need for different tasks:
  • training: memory + compute + data throughput;
  • fine-tuninig: similar to training, cheaper;
  • pre-filling: compute;
  • decoding: memory;

LLM on HPC clusters

LLM on general computers

  • Mostly about inference;
  • Quantization;
  • CPU offloading;
  • Memory-mapped file formats;

HPC clusters

  • Racked computer nodes;
  • Parallel network storage;
  • Infiniband/RoCE networking;

Alvis hardware - compute

Data type A100 A40 V100 T4
FP64 9.7 | 19.5 0.58 7.8 0.25
FP32 19.5 37.4 15.7 8.1
TF32 156 74.8 N/A N/A
FP16 312 149.7 125 65
BF16 312 149.7 N/A N/A
Int8 624 299.3 64 130
Int4 1248 598.7 N/A 260

Alvis hardware - network & storage

  • Fast storage: WEKA file system;
  • Infiniband: 100Gbit (A100 nodes);
  • Ethernet: 25Gbit (most other nodes);

Running LLMs on supercomputers

  • Most common bottleneck: memory
  • Quantized models to fit larger models;
  • Parallelize the model across GPUs or nodes;

Tools to gain information

  • grafana (network utilization, temp disk);
  • nvtop, htop (CPU/GPU utilization, power draw);
  • nvidia nsight (advanced debugging and tracing);

See details in C3SE documentation.

Summary

Take home messages

  • LLMs/neural networks benefit from massive parallelization;
  • Same issue of memeory vs. compute-bound;
  • Some optimization strategies;
  • Be aware of the troubleshooting tools!