LLM formats

This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.

Overview

Formats of LLM models
Formats of numbers
Quantization of LLM
Quantization and performance

Formats of LLM models

So you want to use a LLM model

What the name means

Llama-3.3: model (architecture)
70B: size / number of parameters
Instruct: fine-tuning
AWQ-INT4: quantization
GGUF: model format

File-formats of LLMs

Common formats of LLMs

bin/pth/tf: “raw” ML library formats;
safetensors: used by huggingface;
ggml/gguf: developed by llama.cpp (supports many qunatization formats);
llamafile: by mozilla, single-file format, executable.

In some repos, you can find detailed model information for some model formats, example.

Picking a model

Quantization method;
Number format;
Hardware compatibility.

Formats of numbers

Why do we care?

ML tolerates lower numerical precision;
Quantization allow you to run larger models;
To eliminate expensive communication;

Hardware Compatibility

	hardware accel.	note
fp16/32/64	most gpus	IEEE 754
fp8 (E4M3/E5M2)	hooper	Recent IEEE
bf16	most gpus	Google’s
tf32	nvidia-gpus	Nvidia
int4/8	most GPUs

See also Data types support by AMD RocM.

Rule of thumb

Google’s bf16 if unsure (same range as fp32, less mantissa, good compatibility);
training usually done in fp32/bf16;
int4/8 is good for inference (on older GPUs).

Quantization methods

Quantization target

Weight/activation/mixed percision (w8a16);
KV-cache;
Non-uniform;

(A)symmetric qunatization

Image source: Maarten Grootendorst

position of zero;
range of parameters;
simple for implementation;

Clipping

Image source: Maarten Grootendorst

Calibration for weight quantization

Image source: Maarten Grootendorst

Calibration for activation quantization

Image source: Maarten Grootendorst

The range can be estimated can be dynamically (on the fly) or statically.

Sparsification

Image source: Nvidia technical blog/

Post-training quantization methods (PTQ)

Weight and activation;
Not detailed: sparsification/KV cache.
Calibration/accuracy trade off;

Quantization aware training (QAT)

Image source: Maarten Grootendorst

Quantization aware training (QAT) - cont.

Image source: Maarten Grootendorst

Summary

When choosing a model

Know the hardware/implementation compatibility;
Find the right model/format/qunatization;
Quantize if needed;
Look up/run benchmarks.

LLM formats

Overview

Formats of LLM models

So you want to use a LLM model

What the name means

File-formats of LLMs

Common formats of LLMs

Picking a model

Formats of numbers

Why do we care?

Number formats - floating point

Floating point formats - cont. 1

Floating point formats - cont. 2

Hardware Compatibility

Rule of thumb

Quantization methods

Quantization target

(A)symmetric qunatization

Clipping

Calibration for weight quantization

Calibration for activation quantization

Sparsification

Post-training quantization methods (PTQ)

Quantization aware training (QAT)

Quantization aware training (QAT) - cont.

Summary

When choosing a model

Other useful links