This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.

Llama-3.3: model (architecture)70B: size / number of parametersInstruct: fine-tuningAWQ-INT4: quantizationGGUF: model formatIn some repos, you can find detailed model information for some model formats, example.

Image source: Maarten
Grootendorst
Image source: Hamzael
Shafie

Image source: Maarten
Grootendorst
| hardware accel. | note | |
|---|---|---|
| fp16/32/64 | most gpus | IEEE 754 |
| fp8 (E4M3/E5M2) | hooper | Recent IEEE |
| bf16 | most gpus | Google’s |
| tf32 | nvidia-gpus | Nvidia |
| int4/8 | most GPUs |
See also Data types support by AMD RocM.


Image source: Maarten
Grootendorst

Image source: Maarten
Grootendorst

Image source: Maarten
Grootendorst

Image source: Maarten
Grootendorst
The range can be estimated can be dynamically (on the fly) or statically.

Image source: Nvidia
technical blog/

Image source: Maarten
Grootendorst

Image source: Maarten
Grootendorst
Benchmarks: