LLM formats

This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.

Overview

  • Formats of LLM models
  • Formats of numbers
  • Quantization of LLM
  • Quantization and performance

Formats of LLM models

So you want to use a LLM model

What the name means

  • Llama-3.3: model (architecture)
  • 70B: size / number of parameters
  • Instruct: fine-tuning
  • AWQ-INT4: quantization
  • GGUF: model format

File-formats of LLMs

Common formats of LLMs

  • bin/pth/tf: “raw” ML library formats;
  • safetensors: used by huggingface;
  • ggml/gguf: developed by llama.cpp (supports many qunatization formats);
  • llamafile: by mozilla, single-file format, executable.

In some repos, you can find detailed model information for some model formats, example.

Picking a model

  • Quantization method;
  • Number format;
  • Hardware compatibility.

Formats of numbers

Why do we care?

  • ML tolerates lower numerical precision;
  • Quantization allow you to run larger models;
  • To eliminate expensive communication;

Number formats - floating point


Image source: Maarten Grootendorst

Floating point formats - cont. 1


Image source: Hamzael Shafie

Floating point formats - cont. 2


Image source: Maarten Grootendorst

Hardware Compatibility

hardware accel. note
fp16/32/64 most gpus IEEE 754
fp8 (E4M3/E5M2) hooper Recent IEEE
bf16 most gpus Google’s
tf32 nvidia-gpus Nvidia
int4/8 most GPUs

See also Data types support by AMD RocM.

Rule of thumb

  • Google’s bf16 if unsure (same range as fp32, less mantissa, good compatibility);
  • training usually done in fp32/bf16;
  • int4/8 is good for inference (on older GPUs).

Quantization methods

Quantization target

  • Weight/activation/mixed percision (w8a16);
  • KV-cache;
  • Non-uniform;

(A)symmetric qunatization


Image source: Maarten Grootendorst

  • position of zero;
  • range of parameters;
  • simple for implementation;

Clipping


Image source: Maarten Grootendorst

Calibration for weight quantization


Image source: Maarten Grootendorst

Calibration for activation quantization


Image source: Maarten Grootendorst

The range can be estimated can be dynamically (on the fly) or statically.

Sparsification


Image source: Nvidia technical blog/

Post-training quantization methods (PTQ)

  • Weight and activation;
  • Not detailed: sparsification/KV cache.
  • Calibration/accuracy trade off;

Quantization aware training (QAT)


Image source: Maarten Grootendorst

Quantization aware training (QAT) - cont.


Image source: Maarten Grootendorst

Summary

When choosing a model

  • Know the hardware/implementation compatibility;
  • Find the right model/format/qunatization;
  • Quantize if needed;
  • Look up/run benchmarks.

  1. arXiv:2210.17323↩︎