This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.


Image source: Introduction
to Attention Mechanism






| Data type | A100 | A40 | V100 | T4 |
|---|---|---|---|---|
| FP64 | 9.7 | 19.5 | 0.58 | 7.8 | 0.25 |
| FP32 | 19.5 | 37.4 | 15.7 | 8.1 |
| TF32 | 156 | 74.8 | N/A | N/A |
| FP16 | 312 | 149.7 | 125 | 65 |
| BF16 | 312 | 149.7 | N/A | N/A |
| Int8 | 624 | 299.3 | 64 | 130 |
| Int4 | 1248 | 598.7 | N/A | 260 |


Find details in C3SE documentation.
If you are surprised that models work better with more variables, you are not alone; see double descent.↩︎
arXiv:2106.09685 [cs.CL]↩︎