LLM Formats

This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.

Overview¶

Formats of LLM models
Formats of numbers
Quantization of LLM
Quantization and performance

Formats of LLM models¶

So you want to use a LLM model¶

What the name means¶

Llama-3.3: model (architecture)
70B: size / number of parameters
Instruct: fine-tuning
AWQ-INT4: quantization
GGUF: model format

File-formats of LLMs¶

LLM models commonly consists of metadata such as the metadata, quantization methods, and the tensor themselves. The following shows the layout of the gguf file format.

Image from huggingface.

Common formats of LLMs¶

bin/pth/tf: "raw" ML library formats;
safetensors: used by huggingface;
ggml/gguf: developed by llama.cpp (supports many qunatization formats);
llamafile: by mozilla, single-file format, executable.

In some repos, you can find detailed model information for some model formats, example.

Picking a model¶

Quantization method;
Number format;
Hardware compatibility.

Formats of numbers¶

Why do we care?¶

ML tolerates lower numerical precision;
Quantization allow you to run larger models;
To eliminate expensive communication;

Hardware Compatibility¶

	hardware accel.	note
fp16/32/64	most gpus	IEEE 754
fp8 (E4M3/E5M2)	hooper	Recent IEEE
bf16	most gpus	Google's
tf32	nvidia-gpus	Nvidia
int4/8	most GPUs

See also Data types support by AMD RocM.

Rule of thumb¶

Google's bf16 if unsure (same range as fp32, less mantissa, good compatibility);
training usually done in fp32/bf16;
int4/8 is good for inference (on older GPUs).

Quantization methods¶

Quantization target¶

Weight/activation/mixed percision (w8a16);
KV-cache;
Non-uniform;

(A)symmetric qunatization¶

Image source: Maarten Grootendorst

position of zero;
range of parameters;
simple for implementation;

Clipping¶

Image source: Maarten Grootendorst

Calibration for weight quantization¶

Image source: Maarten Grootendorst

Calibration for activation quantization¶

Image source: Maarten Grootendorst

The range can be estimated can be dynamically (on the fly) or statically.

Sparsification¶

Image source: Nvidia technical blog/

Post-training quantization methods (PTQ)¶

Weight and activation;
Not detailed: sparsification/KV cache.
Calibration/accuracy trade off;

Quantization aware training (QAT)¶

Image source: Maarten Grootendorst

Quantization aware training (QAT) - cont.¶

Image source: Maarten Grootendorst

Summary¶

When choosing a model¶

Know the hardware/implementation compatibility;
Find the right model/format/qunatization;
Quantize if needed;
Look up/run benchmarks.