Quantization¶

Outline¶

Quantization Techniques
Linear Quantization
GPTQ Quantization
AWQ Quantization
Other Methods
Summary
Reference

Note: Some examples may take a lot of VRAM. You can restart the kernel once you hit OOM error.

Quantization Techniques¶

Post trainging quantization (PTQ):
- Post training dynamic quantization: the range for each activation is computed on the fly at runtime.
- Post training static quantization: the range for each activation is computed in advance at quantization-time, typically by passing representative data through the model and recording the activation values.
Quantization aware training (QAT): the range for each activation is computed at training-time. They simulate the error induced by quantization to let the model be aware of quantization error

Reference: https://huggingface.co/docs/optimum/concept_guides/quantization#calibration

Linear quantization¶

Image source: Maarten Grootendorst

$x = S * (x_q - Z)$
When $Z = 0$: symmetrics quantization
It can be applied per tensor or per channel

Affine quantization in Quanto (Int8)¶

In [1]:

Copied!





import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import QuantizedModelForCausalLM, qint8

# https://www.geeksforgeeks.org/nlp/perplexity-for-llm-evaluation/
def compute_perplexity_for_batch(model, tokenizer, input_texts):
    inputs = tokenizer(
        input_texts, return_tensors="pt", padding=True, truncation=True
    )

    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    shift_logits = logits[:, :-1, :] 
    shift_labels = input_ids[:, 1:] 

    log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
    target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
    target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
    negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
    perplexities = torch.exp(negative_log_likelihood)
    mean_perplexity_score = torch.mean(perplexities)

    return {
        "perplexities": perplexities.tolist(),
        "mean_perplexity": mean_perplexity_score.item()
    }
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import QuantizedModelForCausalLM, qint8

# https://www.geeksforgeeks.org/nlp/perplexity-for-llm-evaluation/
def compute_perplexity_for_batch(model, tokenizer, input_texts):
    inputs = tokenizer(
        input_texts, return_tensors="pt", padding=True, truncation=True
    )

    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    shift_logits = logits[:, :-1, :] 
    shift_labels = input_ids[:, 1:] 

    log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
    target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
    target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
    negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
    perplexities = torch.exp(negative_log_likelihood)
    mean_perplexity_score = torch.mean(perplexities)

    return {
        "perplexities": perplexities.tolist(),
        "mean_perplexity": mean_perplexity_score.item()
    }

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

In [2]:

Copied!

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
Parameter containing:
tensor([[-0.0179,  0.0066,  0.0247,  ..., -0.0087, -0.0117,  0.0201],
        [ 0.0122,  0.0593,  0.0552,  ..., -0.0332, -0.0154,  0.0108],
        [ 0.0178,  0.0155,  0.0344,  ..., -0.0386, -0.0386, -0.0276],
        ...,
        [ 0.0298,  0.0352,  0.0713,  ..., -0.0718, -0.0265, -0.0287],
        [ 0.0226, -0.0248,  0.0352,  ..., -0.0120, -0.0287, -0.0148],
        [-0.0258, -0.0537, -0.0131,  ...,  0.0542,  0.0096, -0.0028]],
       requires_grad=True)

In [3]:

Copied!





tokenizer = AutoTokenizer.from_pretrained(model_name)

example_texts = [
    "Once upon a time, there was a brave knight.",
    "In a galaxy far, far away, a new adventure began."
]

# Compute perplexity scores for the batch of input texts
results = compute_perplexity_for_batch(model, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")
tokenizer = AutoTokenizer.from_pretrained(model_name)

example_texts = [
    "Once upon a time, there was a brave knight.",
    "In a galaxy far, far away, a new adventure began."
]

# Compute perplexity scores for the batch of input texts
results = compute_perplexity_for_batch(model, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")

Perplexity scores for each text: [45.347049713134766, 16.073394775390625]

In [4]:

Copied!

qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint8, exclude='lm_head')
print(qmodel)
print(qmodel.model.layers[0].self_attn.q_proj.weight)

qmodel.save_pretrained('output/official/QLlama-3.2-1B')
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint8, exclude='lm_head')
print(qmodel)
print(qmodel.model.layers[0].self_attn.q_proj.weight)

qmodel.save_pretrained('output/official/QLlama-3.2-1B')

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
<class 'optimum.quanto.tensor.weights.qbytes.WeightQBytesTensor'>(tensor([[-44,  16,  61,  ..., -22, -29,  50],
        [ 14,  70,  65,  ..., -39, -18,  13],
        [ 16,  14,  30,  ..., -34, -34, -24],
        ...,
        [ 17,  20,  40,  ..., -40, -15, -16],
        [ 32, -35,  50,  ..., -17, -41, -21],
        [-14, -30,  -7,  ...,  30,   5,  -2]], dtype=torch.int8), scale=tensor([[0.0004],
        [0.0008],
        [0.0011],
        ...,
        [0.0018],
        [0.0007],
        [0.0018]]), dtype=torch.float32)

In [5]:

Copied!

results = compute_perplexity_for_batch(qmodel, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")
results = compute_perplexity_for_batch(qmodel, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")

Perplexity scores for each text: [45.38690948486328, 16.23394012451172]

Quanto integration in Transformers¶

In [6]:

Copied!

from transformers import AutoModelForCausalLM, QuantoConfig

quantization_config = QuantoConfig(weights="int8", activations=None)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)

# quanto quantized model cannot be serialized from transformers and cannot be saved
# model.save_pretrained("output/transformers/QLlama-3.2-1B")
from transformers import AutoModelForCausalLM, QuantoConfig

quantization_config = QuantoConfig(weights="int8", activations=None)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)

# quanto quantized model cannot be serialized from transformers and cannot be saved
# model.save_pretrained("output/transformers/QLlama-3.2-1B")

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
<class 'optimum.quanto.tensor.weights.qbytes.WeightQBytesTensor'>(tensor([[-44,  16,  61,  ..., -22, -29,  50],
        [ 14,  70,  65,  ..., -39, -18,  13],
        [ 16,  14,  30,  ..., -34, -34, -24],
        ...,
        [ 17,  20,  40,  ..., -40, -15, -16],
        [ 32, -35,  50,  ..., -17, -41, -21],
        [-14, -30,  -7,  ...,  30,   5,  -2]], dtype=torch.int8), scale=tensor([[0.0004],
        [0.0008],
        [0.0011],
        ...,
        [0.0018],
        [0.0007],
        [0.0018]]), dtype=torch.float32)

Activation quantiztion / Calibration in Quanto¶

In [1]:

Copied!





import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze, qint8, Calibration, quantization_map
from safetensors.torch import save_file
from datasets import load_dataset

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", use_cache=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))

quantize(model, weights=qint8, activations=qint8)
with torch.no_grad(), Calibration(momentum=0.9):
    model.eval()
    for batch in calibration_dataset.iter(batch_size=2):
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True)
        input_ids = inputs.input_ids.to(model.device)
        attention_mask = inputs.attention_mask.to(model.device)
        output = model(input_ids, attention_mask=attention_mask)

        # good habit
        del input_ids, attention_mask
        torch.cuda.empty_cache()

print(model)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze, qint8, Calibration, quantization_map
from safetensors.torch import save_file
from datasets import load_dataset

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", use_cache=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))

quantize(model, weights=qint8, activations=qint8)
with torch.no_grad(), Calibration(momentum=0.9):
    model.eval()
    for batch in calibration_dataset.iter(batch_size=2):
        inputs = tokenizer(batch["text"], return_tensors="pt", padding=True)
        input_ids = inputs.input_ids.to(model.device)
        attention_mask = inputs.attention_mask.to(model.device)
        output = model(input_ids, attention_mask=attention_mask)

        # good habit
        del input_ids, attention_mask
        torch.cuda.empty_cache()

print(model)

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): QLinear(in_features=2048, out_features=128256, bias=False)
)

In [2]:

Copied!





import os
import json

os.makedirs("output/calibration", exist_ok=True)

# Freeze integer weights
freeze(model)

# Serialize quantized model
save_file(model.state_dict(), 'output/calibration/QLlama-3.2-1B/model.safetensors')
# Store the quantized models quantization map
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'w') as f:
    json.dump(quantization_map(model), f)
import os
import json

os.makedirs("output/calibration", exist_ok=True)

# Freeze integer weights
freeze(model)

# Serialize quantized model
save_file(model.state_dict(), 'output/calibration/QLlama-3.2-1B/model.safetensors')
# Store the quantized models quantization map
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'w') as f:
    json.dump(quantization_map(model), f)

In [3]:

Copied!





from safetensors.torch import load_file
from optimum.quanto import requantize
from transformers import AutoModelForCausalLM, AutoConfig

state_dict = load_file('output/calibration/QLlama-3.2-1B/model.safetensors')
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'r') as f:
    quantization_map = json.load(f)

# Create an empty model from your modeling code and requantize it
config = AutoConfig.from_pretrained("/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/config.json")
model = AutoModelForCausalLM.from_config(config)
requantize(model, state_dict, quantization_map, device=torch.device('cuda'))
from safetensors.torch import load_file
from optimum.quanto import requantize
from transformers import AutoModelForCausalLM, AutoConfig

state_dict = load_file('output/calibration/QLlama-3.2-1B/model.safetensors')
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'r') as f:
    quantization_map = json.load(f)

# Create an empty model from your modeling code and requantize it
config = AutoConfig.from_pretrained("/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/config.json")
model = AutoModelForCausalLM.from_config(config)
requantize(model, state_dict, quantization_map, device=torch.device('cuda'))

In [4]:

Copied!

print(model)
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
          (k_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (v_proj): QLinear(in_features=2048, out_features=512, bias=False)
          (o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
          (down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): QLinear(in_features=2048, out_features=128256, bias=False)
)

Outlier problem¶

Quanto simply uses absmax() to calculate scale
Outlier would compress most values to 0

LLM.int8() in Bitsandbytes¶

Save outlier in another tensor to keep information
Model is quantized on the fly without loading model in full precision

Image source: Dettmers+2022

In [1]:

Copied!





from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto",
    quantization_config=quantization_config, 
    torch_dtype="auto"
)
print(model)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto",
    quantization_config=quantization_config, 
    torch_dtype="auto"
)
print(model)

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear8bitLt(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

In [2]:

Copied!





for name, param in model.named_parameters():
    if hasattr(param, "SCB"):
        print(name)
        print(param)
        print(param.SCB)
        break
print(model.get_memory_footprint() / 1e9)
for name, param in model.named_parameters():
    if hasattr(param, "SCB"):
        print(name)
        print(param)
        print(param.SCB)
        break
print(model.get_memory_footprint() / 1e9)

model.layers.0.self_attn.q_proj.weight
Parameter containing:
Parameter(Int8Params([[-44,  16,  61,  ..., -22, -29,  50],
            [ 14,  70,  65,  ..., -39, -18,  13],
            [ 16,  14,  30,  ..., -34, -34, -24],
            ...,
            [ 17,  20,  40,  ..., -40, -15, -16],
            [ 32, -35,  50,  ..., -17, -41, -21],
            [-14, -30,  -7,  ...,  30,   5,  -2]], device='cuda:0',
           dtype=torch.int8))
tensor([0.0515, 0.1079, 0.1436,  ..., 0.2256, 0.0898, 0.2305], device='cuda:0')
1.4985504

GPTQ (Generative Pre-trained Transformer Quantizer) Quantization¶

Process weights sequentially
Compensate error induced by currect step by updating the not-yet-quantized weights

GPTQ in GPTQModel ¶

In [1]:

Copied!





from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoModelForCausalLM

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4)

model = GPTQModel.load(model_name, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2);

model.save("output/official/QLlama-3.2-1B")
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoModelForCausalLM

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4)

model = GPTQModel.load(model_name, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2);

model.save("output/official/QLlama-3.2-1B")

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          
INFO  Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
INFO  Loader: Auto dtype (native bfloat16): `torch.bfloat16`

INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=128004 (token='<|finetune_right_pad_id|>').

INFO  Model: Loaded `generation_config`: GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

INFO  Kernel: loaded -> `[]`                                                   
INFO  Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`   
INFO  Process: progress logs for `gptq` will be streamed to file: `gptq_log_preexperiment_time_11_19_2025_18h_56m_52s.log`
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | self_attn.k_proj     | 0.30117315 | 1024        | 0.01000     | 1.217     | 3.466        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | self_attn.v_proj     | 0.00808881 | 1024        | 0.01000     | 0.450     | 3.466        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | self_attn.q_proj     | 0.61798751 | 1024        | 0.01000     | 0.454     | 3.466        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | self_attn.o_proj     | 0.00077137 | 1024        | 0.01000     | 0.456     | 2.546        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | mlp.up_proj          | 0.51245892 | 1024        | 0.01000     | 0.463     | 2.947        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | mlp.gate_proj        | 0.64624822 | 1024        | 0.01000     | 0.460     | 2.947        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 0         | mlp.down_proj        | 0.00407354 | 1024        | 0.01000     | 2.052     | 7.199        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.k_proj     | 0.49343121 | 1024        | 0.01000     | 0.454     | 3.005        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.v_proj     | 0.02866399 | 1024        | 0.01000     | 0.453     | 3.005        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.q_proj     | 0.99162263 | 1024        | 0.01000     | 0.454     | 3.005        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | self_attn.o_proj     | 0.00303621 | 1024        | 0.01000     | 0.456     | 2.275        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | mlp.up_proj          | 0.76688707 | 1024        | 0.01000     | 0.467     | 2.669        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | mlp.gate_proj        | 1.05201042 | 1024        | 0.01000     | 0.465     | 2.669        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 1         | mlp.down_proj        | 1.66648114 | 1024        | 0.01000     | 1.946     | 6.984        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | self_attn.k_proj     | 1.03801095 | 1024        | 0.01000     | 0.453     | 3.012        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | self_attn.v_proj     | 0.06991096 | 1024        | 0.01000     | 0.453     | 3.012        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | self_attn.q_proj     | 1.99752581 | 1024        | 0.01000     | 0.455     | 3.012        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | self_attn.o_proj     | 0.00342211 | 1024        | 0.01000     | 0.453     | 2.275        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | mlp.up_proj          | 1.00037694 | 1024        | 0.01000     | 0.463     | 2.677        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | mlp.gate_proj        | 1.58194685 | 1024        | 0.01000     | 0.463     | 2.677        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 2         | mlp.down_proj        | 0.01264102 | 1024        | 0.01000     | 1.944     | 7.000        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | self_attn.k_proj     | 0.66864121 | 1024        | 0.01000     | 0.457     | 3.011        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | self_attn.v_proj     | 0.09188035 | 1024        | 0.01000     | 0.455     | 3.011        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | self_attn.q_proj     | 1.47139096 | 1024        | 0.01000     | 0.454     | 3.011        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | self_attn.o_proj     | 0.00652087 | 1024        | 0.01000     | 0.455     | 2.278        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | mlp.up_proj          | 1.23837781 | 1024        | 0.01000     | 0.467     | 2.677        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | mlp.gate_proj        | 2.43490934 | 1024        | 0.01000     | 0.466     | 2.677        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 3         | mlp.down_proj        | 0.01894283 | 1024        | 0.01000     | 1.953     | 6.983        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 4         | self_attn.k_proj     | 0.72414231 | 1024        | 0.01000     | 0.458     | 3.012        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 4         | self_attn.v_proj     | 0.08650243 | 1024        | 0.01000     | 0.457     | 3.012        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 4         | self_attn.q_proj     | 1.49932063 | 1024        | 0.01000     | 0.455     | 3.012        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 4         | self_attn.o_proj     | 0.00972512 | 1024        | 0.01000     | 0.456     | 2.284        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 4         | mlp.up_proj          | 1.27032411 | 1024        | 0.01000     | 0.467     | 2.678        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 4         | mlp.gate_proj        | 2.69963837 | 1024        | 0.01000     | 0.467     | 2.678        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | self_attn.k_proj     | 0.84886682 | 1024        | 0.01000     | 0.458     | 3.030        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | self_attn.v_proj     | 0.13304198 | 1024        | 0.01000     | 0.454     | 3.030        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | self_attn.q_proj     | 2.07612371 | 1024        | 0.01000     | 0.456     | 3.030        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | self_attn.o_proj     | 0.02503609 | 1024        | 0.01000     | 0.458     | 2.294        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | mlp.up_proj          | 1.74318504 | 1024        | 0.01000     | 0.464     | 2.693        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | mlp.gate_proj        | 2.82753754 | 1024        | 0.01000     | 0.461     | 2.693        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 9         | mlp.down_proj        | 0.04964223 | 1024        | 0.01000     | 1.957     | 7.048        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | self_attn.k_proj     | 1.04286778 | 1024        | 0.01000     | 0.456     | 3.024        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | self_attn.v_proj     | 0.16099945 | 1024        | 0.01000     | 0.453     | 3.024        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | self_attn.q_proj     | 2.18314767 | 1024        | 0.01000     | 0.453     | 3.024        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | self_attn.o_proj     | 0.01785076 | 1024        | 0.01000     | 0.453     | 2.281        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | mlp.up_proj          | 2.06139469 | 1024        | 0.01000     | 0.466     | 2.684        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | mlp.gate_proj        | 3.23025703 | 1024        | 0.01000     | 0.463     | 2.684        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 10        | mlp.down_proj        | 0.06254576 | 1024        | 0.01000     | 1.944     | 7.013        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | self_attn.k_proj     | 1.26511097 | 1024        | 0.01000     | 0.455     | 3.022        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | self_attn.v_proj     | 0.15488198 | 1024        | 0.01000     | 0.455     | 3.022        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | self_attn.q_proj     | 2.15340161 | 1024        | 0.01000     | 0.455     | 3.022        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | self_attn.o_proj     | 0.01433891 | 1024        | 0.01000     | 0.456     | 2.291        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | mlp.up_proj          | 2.25294113 | 1024        | 0.01000     | 0.470     | 2.694        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | mlp.gate_proj        | 3.46270084 | 1024        | 0.01000     | 0.468     | 2.694        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 11        | mlp.down_proj        | 0.06779625 | 1024        | 0.01000     | 1.962     | 7.022        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | self_attn.k_proj     | 1.29417920 | 1024        | 0.01000     | 0.455     | 3.021        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | self_attn.v_proj     | 0.15706307 | 1024        | 0.01000     | 0.454     | 3.021        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | self_attn.q_proj     | 2.15997839 | 1024        | 0.01000     | 0.454     | 3.021        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | self_attn.o_proj     | 0.01520295 | 1024        | 0.01000     | 0.456     | 2.280        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | mlp.up_proj          | 2.36928177 | 1024        | 0.01000     | 0.467     | 2.683        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | mlp.gate_proj        | 3.45540619 | 1024        | 0.01000     | 0.465     | 2.683        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 12        | mlp.down_proj        | 0.07528380 | 1024        | 0.01000     | 1.945     | 7.010        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | self_attn.k_proj     | 1.27076578 | 1024        | 0.01000     | 0.460     | 3.020        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | self_attn.v_proj     | 0.26116973 | 1024        | 0.01000     | 0.456     | 3.020        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | self_attn.q_proj     | 2.49135661 | 1024        | 0.01000     | 0.456     | 3.020        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | self_attn.o_proj     | 0.02025730 | 1024        | 0.01000     | 0.459     | 2.286        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | mlp.up_proj          | 2.80010867 | 1024        | 0.01000     | 0.469     | 2.689        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | mlp.gate_proj        | 3.75511003 | 1024        | 0.01000     | 0.468     | 2.689        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 13        | mlp.down_proj        | 0.11440941 | 1024        | 0.01000     | 1.954     | 7.022        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | self_attn.k_proj     | 1.46573043 | 1024        | 0.01000     | 0.460     | 3.029        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | self_attn.v_proj     | 0.58527297 | 1024        | 0.01000     | 0.455     | 3.029        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | self_attn.q_proj     | 2.69790554 | 1024        | 0.01000     | 0.455     | 3.029        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | self_attn.o_proj     | 0.05254138 | 1024        | 0.01000     | 0.459     | 2.285        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | mlp.up_proj          | 3.32393456 | 1024        | 0.01000     | 0.468     | 2.686        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | mlp.gate_proj        | 4.81758022 | 1024        | 0.01000     | 0.467     | 2.686        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 14        | mlp.down_proj        | 0.16269645 | 1024        | 0.01000     | 1.962     | 7.034        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | process     | layer     | module               | loss           | samples     | damp        | time      | fwd_time     |
INFO  --------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | self_attn.k_proj     | 1.39538884 | 1024        | 0.01000     | 0.458     | 3.040        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | self_attn.v_proj     | 0.59370136 | 1024        | 0.01000     | 0.454     | 3.040        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | self_attn.q_proj     | 2.50607896 | 1024        | 0.01000     | 0.454     | 3.040        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | self_attn.o_proj     | 0.18327969 | 1024        | 0.01000     | 0.453     | 2.302        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | mlp.up_proj          | 4.28998661 | 1024        | 0.01000     | 0.468     | 2.706        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | mlp.gate_proj        | 5.70833874 | 1024        | 0.01000     | 0.467     | 2.706        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  | gptq        | 15        | mlp.down_proj        | 0.43429565 | 1024        | 0.01000     | 1.955     | 7.111        | 
INFO  --------------------------------------------------------------------------------------------------------------------------------
INFO  {'process': 'gptq', 'layer': 0, 'module': 'self_attn.k_proj', 'loss': '0.30117315', 'samples': '1024', 'damp': '0.01000', 'time': '1.217', 'fwd_time': '3.466'}
INFO  {'process': 'gptq', 'layer': 0, 'module': 'self_attn.v_proj', 'loss': '0.00808881', 'samples': '1024', 'damp': '0.01000', 'time': '0.450', 'fwd_time': '3.466'}
INFO  {'process': 'gptq', 'layer': 0, 'module': 'self_attn.q_proj', 'loss': '0.61798751', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.466'}
INFO  {'process': 'gptq', 'layer': 0, 'module': 'self_attn.o_proj', 'loss': '0.00077137', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.546'}
INFO  {'process': 'gptq', 'layer': 0, 'module': 'mlp.up_proj', 'loss': '0.51245892', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.947'}
INFO  {'process': 'gptq', 'layer': 0, 'module': 'mlp.gate_proj', 'loss': '0.64624822', 'samples': '1024', 'damp': '0.01000', 'time': '0.460', 'fwd_time': '2.947'}
INFO  {'process': 'gptq', 'layer': 0, 'module': 'mlp.down_proj', 'loss': '0.00407354', 'samples': '1024', 'damp': '0.01000', 'time': '2.052', 'fwd_time': '7.199'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'self_attn.k_proj', 'loss': '0.49343121', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.005'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'self_attn.v_proj', 'loss': '0.02866399', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.005'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'self_attn.q_proj', 'loss': '0.99162263', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.005'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'self_attn.o_proj', 'loss': '0.00303621', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.275'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'mlp.up_proj', 'loss': '0.76688707', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.669'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'mlp.gate_proj', 'loss': '1.05201042', 'samples': '1024', 'damp': '0.01000', 'time': '0.465', 'fwd_time': '2.669'}
INFO  {'process': 'gptq', 'layer': 1, 'module': 'mlp.down_proj', 'loss': '1.66648114', 'samples': '1024', 'damp': '0.01000', 'time': '1.946', 'fwd_time': '6.984'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'self_attn.k_proj', 'loss': '1.03801095', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.012'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'self_attn.v_proj', 'loss': '0.06991096', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.012'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'self_attn.q_proj', 'loss': '1.99752581', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.012'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'self_attn.o_proj', 'loss': '0.00342211', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '2.275'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'mlp.up_proj', 'loss': '1.00037694', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.677'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'mlp.gate_proj', 'loss': '1.58194685', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.677'}
INFO  {'process': 'gptq', 'layer': 2, 'module': 'mlp.down_proj', 'loss': '0.01264102', 'samples': '1024', 'damp': '0.01000', 'time': '1.944', 'fwd_time': '7.000'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'self_attn.k_proj', 'loss': '0.66864121', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.011'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'self_attn.v_proj', 'loss': '0.09188035', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.011'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'self_attn.q_proj', 'loss': '1.47139096', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.011'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'self_attn.o_proj', 'loss': '0.00652087', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '2.278'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'mlp.up_proj', 'loss': '1.23837781', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.677'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'mlp.gate_proj', 'loss': '2.43490934', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.677'}
INFO  {'process': 'gptq', 'layer': 3, 'module': 'mlp.down_proj', 'loss': '0.01894283', 'samples': '1024', 'damp': '0.01000', 'time': '1.953', 'fwd_time': '6.983'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'self_attn.k_proj', 'loss': '0.72414231', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.012'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'self_attn.v_proj', 'loss': '0.08650243', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.012'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'self_attn.q_proj', 'loss': '1.49932063', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.012'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'self_attn.o_proj', 'loss': '0.00972512', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.284'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'mlp.up_proj', 'loss': '1.27032411', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.678'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'mlp.gate_proj', 'loss': '2.69963837', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.678'}
INFO  {'process': 'gptq', 'layer': 4, 'module': 'mlp.down_proj', 'loss': '0.02200124', 'samples': '1024', 'damp': '0.01000', 'time': '1.959', 'fwd_time': '6.990'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'self_attn.k_proj', 'loss': '1.15480590', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.008'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'self_attn.v_proj', 'loss': '0.07708703', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.008'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'self_attn.q_proj', 'loss': '1.99378765', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.008'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'self_attn.o_proj', 'loss': '0.01000706', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '2.277'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'mlp.up_proj', 'loss': '1.38294625', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.673'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'mlp.gate_proj', 'loss': '2.51848507', 'samples': '1024', 'damp': '0.01000', 'time': '0.464', 'fwd_time': '2.673'}
INFO  {'process': 'gptq', 'layer': 5, 'module': 'mlp.down_proj', 'loss': '0.02658287', 'samples': '1024', 'damp': '0.01000', 'time': '1.957', 'fwd_time': '6.980'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'self_attn.k_proj', 'loss': '0.88846290', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.018'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'self_attn.v_proj', 'loss': '0.09963284', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.018'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'self_attn.q_proj', 'loss': '1.42152846', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.018'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'self_attn.o_proj', 'loss': '0.01500860', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '2.285'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'mlp.up_proj', 'loss': '1.40616870', 'samples': '1024', 'damp': '0.01000', 'time': '0.469', 'fwd_time': '2.686'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'mlp.gate_proj', 'loss': '2.51148319', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.686'}
INFO  {'process': 'gptq', 'layer': 6, 'module': 'mlp.down_proj', 'loss': '0.02717795', 'samples': '1024', 'damp': '0.01000', 'time': '1.951', 'fwd_time': '7.019'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'self_attn.k_proj', 'loss': '0.88643467', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.031'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'self_attn.v_proj', 'loss': '0.11398130', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.031'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'self_attn.q_proj', 'loss': '1.67166734', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.031'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'self_attn.o_proj', 'loss': '0.01493945', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '2.292'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'mlp.up_proj', 'loss': '1.47412217', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.695'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'mlp.gate_proj', 'loss': '2.38286161', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.695'}
INFO  {'process': 'gptq', 'layer': 7, 'module': 'mlp.down_proj', 'loss': '0.03029270', 'samples': '1024', 'damp': '0.01000', 'time': '1.961', 'fwd_time': '7.045'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'self_attn.k_proj', 'loss': '1.08758116', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.028'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'self_attn.v_proj', 'loss': '0.10918085', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.028'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'self_attn.q_proj', 'loss': '1.79880595', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.028'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'self_attn.o_proj', 'loss': '0.01832559', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '2.288'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'mlp.up_proj', 'loss': '1.64151132', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.690'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'mlp.gate_proj', 'loss': '2.58264875', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.690'}
INFO  {'process': 'gptq', 'layer': 8, 'module': 'mlp.down_proj', 'loss': '0.04075113', 'samples': '1024', 'damp': '0.01000', 'time': '1.955', 'fwd_time': '7.031'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'self_attn.k_proj', 'loss': '0.84886682', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.030'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'self_attn.v_proj', 'loss': '0.13304198', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.030'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'self_attn.q_proj', 'loss': '2.07612371', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.030'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'self_attn.o_proj', 'loss': '0.02503609', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '2.294'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'mlp.up_proj', 'loss': '1.74318504', 'samples': '1024', 'damp': '0.01000', 'time': '0.464', 'fwd_time': '2.693'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'mlp.gate_proj', 'loss': '2.82753754', 'samples': '1024', 'damp': '0.01000', 'time': '0.461', 'fwd_time': '2.693'}
INFO  {'process': 'gptq', 'layer': 9, 'module': 'mlp.down_proj', 'loss': '0.04964223', 'samples': '1024', 'damp': '0.01000', 'time': '1.957', 'fwd_time': '7.048'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'self_attn.k_proj', 'loss': '1.04286778', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.024'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'self_attn.v_proj', 'loss': '0.16099945', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.024'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'self_attn.q_proj', 'loss': '2.18314767', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.024'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'self_attn.o_proj', 'loss': '0.01785076', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '2.281'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'mlp.up_proj', 'loss': '2.06139469', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.684'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'mlp.gate_proj', 'loss': '3.23025703', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.684'}
INFO  {'process': 'gptq', 'layer': 10, 'module': 'mlp.down_proj', 'loss': '0.06254576', 'samples': '1024', 'damp': '0.01000', 'time': '1.944', 'fwd_time': '7.013'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'self_attn.k_proj', 'loss': '1.26511097', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.022'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'self_attn.v_proj', 'loss': '0.15488198', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.022'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'self_attn.q_proj', 'loss': '2.15340161', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.022'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'self_attn.o_proj', 'loss': '0.01433891', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.291'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'mlp.up_proj', 'loss': '2.25294113', 'samples': '1024', 'damp': '0.01000', 'time': '0.470', 'fwd_time': '2.694'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'mlp.gate_proj', 'loss': '3.46270084', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.694'}
INFO  {'process': 'gptq', 'layer': 11, 'module': 'mlp.down_proj', 'loss': '0.06779625', 'samples': '1024', 'damp': '0.01000', 'time': '1.962', 'fwd_time': '7.022'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'self_attn.k_proj', 'loss': '1.29417920', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.021'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'self_attn.v_proj', 'loss': '0.15706307', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.021'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'self_attn.q_proj', 'loss': '2.15997839', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.021'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'self_attn.o_proj', 'loss': '0.01520295', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.280'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'mlp.up_proj', 'loss': '2.36928177', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.683'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'mlp.gate_proj', 'loss': '3.45540619', 'samples': '1024', 'damp': '0.01000', 'time': '0.465', 'fwd_time': '2.683'}
INFO  {'process': 'gptq', 'layer': 12, 'module': 'mlp.down_proj', 'loss': '0.07528380', 'samples': '1024', 'damp': '0.01000', 'time': '1.945', 'fwd_time': '7.010'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'self_attn.k_proj', 'loss': '1.27076578', 'samples': '1024', 'damp': '0.01000', 'time': '0.460', 'fwd_time': '3.020'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'self_attn.v_proj', 'loss': '0.26116973', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.020'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'self_attn.q_proj', 'loss': '2.49135661', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.020'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'self_attn.o_proj', 'loss': '0.02025730', 'samples': '1024', 'damp': '0.01000', 'time': '0.459', 'fwd_time': '2.286'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'mlp.up_proj', 'loss': '2.80010867', 'samples': '1024', 'damp': '0.01000', 'time': '0.469', 'fwd_time': '2.689'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'mlp.gate_proj', 'loss': '3.75511003', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.689'}
INFO  {'process': 'gptq', 'layer': 13, 'module': 'mlp.down_proj', 'loss': '0.11440941', 'samples': '1024', 'damp': '0.01000', 'time': '1.954', 'fwd_time': '7.022'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'self_attn.k_proj', 'loss': '1.46573043', 'samples': '1024', 'damp': '0.01000', 'time': '0.460', 'fwd_time': '3.029'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'self_attn.v_proj', 'loss': '0.58527297', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.029'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'self_attn.q_proj', 'loss': '2.69790554', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.029'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'self_attn.o_proj', 'loss': '0.05254138', 'samples': '1024', 'damp': '0.01000', 'time': '0.459', 'fwd_time': '2.285'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'mlp.up_proj', 'loss': '3.32393456', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.686'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'mlp.gate_proj', 'loss': '4.81758022', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.686'}
INFO  {'process': 'gptq', 'layer': 14, 'module': 'mlp.down_proj', 'loss': '0.16269645', 'samples': '1024', 'damp': '0.01000', 'time': '1.962', 'fwd_time': '7.034'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'self_attn.k_proj', 'loss': '1.39538884', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.040'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'self_attn.v_proj', 'loss': '0.59370136', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.040'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'self_attn.q_proj', 'loss': '2.50607896', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.040'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'self_attn.o_proj', 'loss': '0.18327969', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '2.302'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'mlp.up_proj', 'loss': '4.28998661', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.706'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'mlp.gate_proj', 'loss': '5.70833874', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.706'}
INFO  {'process': 'gptq', 'layer': 15, 'module': 'mlp.down_proj', 'loss': '0.43429565', 'samples': '1024', 'damp': '0.01000', 'time': '1.955', 'fwd_time': '7.111'}
INFO  Packing model...                                                         
INFO  Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`   
INFO  Packing Kernel: Auto-selection: adding candidate `TorchQuantLinear`      
INFO  Kernel: candidates -> `[TritonV2QuantLinear, TorchQuantLinear]`          
INFO  Kernel: selected -> `TritonV2QuantLinear`.                               
INFO  Model packed.                                                            0%
INFO  Format: Converting GPTQ v2 to v1                                         
INFO  Saved Quantize Config: 
{
  "bits": 4,
  "group_size": 128,
  "desc_act": true,
  "sym": true,
  "lm_head": false,
  "quant_method": "gptq",
  "checkpoint_format": "gptq",
  "pack_dtype": "int32",
  "meta": {
    "quantizer": [
      "gptqmodel:2.2.0"
    ],
    "uri": "https://github.com/modelcloud/gptqmodel",
    "damp_percent": 0.01,
    "damp_auto_increment": 0.0025,
    "static_groups": false,
    "true_sequential": true,
    "mse": 0.0
  }
}
Files in directory:
quant_log.csv
special_tokens_map.json
generation_config.json
quanto_qmap.json
quantize_config.json
tokenizer.json
tokenizer_config.json
README.md
model.safetensors
chat_template.jinja
config.json
Content of saved `generation_config.json`:
{
    "bos_token_id": 128000,
    "do_sample": true,
    "eos_token_id": [
        128001,
        128008,
        128009
    ],
    "temperature": 0.6,
    "top_p": 0.9,
    "transformers_version": "4.52.4"
}
Content of saved `config.json`:
{
    "architectures": [
        "LlamaForCausalLM"
    ],
    "attention_bias": false,
    "attention_dropout": 0.0,
    "bos_token_id": 128000,
    "eos_token_id": [
        128001,
        128008,
        128009
    ],
    "head_dim": 64,
    "hidden_act": "silu",
    "hidden_size": 2048,
    "initializer_range": 0.02,
    "intermediate_size": 8192,
    "max_position_embeddings": 131072,
    "mlp_bias": false,
    "model_type": "llama",
    "num_attention_heads": 32,
    "num_hidden_layers": 16,
    "num_key_value_heads": 8,
    "pretraining_tp": 1,
    "quantization_config": {
        "bits": 4,
        "checkpoint_format": "gptq",
        "desc_act": true,
        "group_size": 128,
        "lm_head": false,
        "meta": {
            "damp_auto_increment": 0.0025,
            "damp_percent": 0.01,
            "mse": 0.0,
            "quantizer": [
                "gptqmodel:2.2.0"
            ],
            "static_groups": false,
            "true_sequential": true,
            "uri": "https://github.com/modelcloud/gptqmodel"
        },
        "pack_dtype": "int32",
        "quant_method": "gptq",
        "sym": true
    },
    "rms_norm_eps": 1e-05,
    "rope_scaling": {
        "factor": 32.0,
        "high_freq_factor": 4.0,
        "low_freq_factor": 1.0,
        "original_max_position_embeddings": 8192,
        "rope_type": "llama3"
    },
    "rope_theta": 500000.0,
    "tie_word_embeddings": true,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.52.4",
    "use_cache": true,
    "vocab_size": 128256
}
INFO  Pre-Quantized model size: 2357.14MB, 2.30GB                              
INFO  Quantized model size: 984.55MB, 0.96GB                                   
INFO  Size difference: 1372.59MB, 1.34GB - 58.23%

GPTQ integration in Transformers¶

In [1]:

Copied!





from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

tokenizer=AutoTokenizer.from_pretrained(model_name)
gptq_config=GPTQConfig(
    bits=4,
    dataset="c4", # optimum will download 'en/c4-train.00000-of-01024.json.gz'
    tokenizer=tokenizer,
)

quantized_model=AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=gptq_config
)
print(quantized_model)
print(quantized_model.get_memory_footprint())

quantized_model.save_pretrained("output/transformers/QLlama-3.2-1B")
tokenizer.save_pretrained("output/transformers/QLlama-3.2-1B")
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"

tokenizer=AutoTokenizer.from_pretrained(model_name)
gptq_config=GPTQConfig(
    bits=4,
    dataset="c4", # optimum will download 'en/c4-train.00000-of-01024.json.gz'
    tokenizer=tokenizer,
)

quantized_model=AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=gptq_config
)
print(quantized_model)
print(quantized_model.get_memory_footprint())

quantized_model.save_pretrained("output/transformers/QLlama-3.2-1B")
tokenizer.save_pretrained("output/transformers/QLlama-3.2-1B")

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

INFO  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.

INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).

Quantizing model.layers blocks :   0%|          | 0/16 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Start quantizing block model.layers 1/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 1/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 1/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 2/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 2/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 2/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 3/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 3/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 3/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 4/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 4/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 4/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 5/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 5/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 5/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 6/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 6/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 6/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 7/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 7/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 7/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 8/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 8/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 8/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 9/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 9/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 9/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 10/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 10/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 10/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 11/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 11/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 11/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 12/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 12/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 12/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 13/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 13/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 13/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 14/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 14/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 14/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 15/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 15/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 15/16...
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 16/16
INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 16/16...
INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 16/16...
INFO:optimum.gptq.quantizer:Packing model...

INFO  Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.0.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.0.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.0.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.0.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.1.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.1.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.1.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.1.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.2.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.2.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.2.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.2.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.2.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.2.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.2.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.3.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.3.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.3.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.3.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.3.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.3.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.3.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.4.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.4.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.4.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.4.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.4.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.4.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.4.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.5.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.5.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.5.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.5.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.5.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.5.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.5.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.6.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.6.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.6.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.6.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.6.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.6.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.6.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.7.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.7.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.7.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.7.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.7.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.7.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.7.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.8.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.8.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.8.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.8.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.8.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.8.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.8.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.9.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.9.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.9.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.9.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.9.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.9.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.9.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.10.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.10.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.10.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.10.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.10.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.10.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.10.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.11.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.11.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.11.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.11.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.11.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.11.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.11.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.12.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.12.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.12.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.12.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.12.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.12.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.12.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.13.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.13.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.13.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.13.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.13.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.13.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.13.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.14.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.14.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.14.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.14.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.14.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.14.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.14.mlp.up_proj
INFO:optimum.gptq.quantizer:model.layers.15.self_attn.k_proj
INFO:optimum.gptq.quantizer:model.layers.15.self_attn.o_proj
INFO:optimum.gptq.quantizer:model.layers.15.self_attn.q_proj
INFO:optimum.gptq.quantizer:model.layers.15.self_attn.v_proj
INFO:optimum.gptq.quantizer:model.layers.15.mlp.down_proj
INFO:optimum.gptq.quantizer:model.layers.15.mlp.gate_proj
INFO:optimum.gptq.quantizer:model.layers.15.mlp.up_proj
INFO:optimum.gptq.quantizer:Model packed.

INFO  Optimize: `TritonV2QuantLinear` compilation triggered.

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (k_proj): TritonV2QuantLinear()
          (o_proj): TritonV2QuantLinear()
          (q_proj): TritonV2QuantLinear()
          (v_proj): TritonV2QuantLinear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLU()
          (down_proj): TritonV2QuantLinear()
          (gate_proj): TritonV2QuantLinear()
          (up_proj): TritonV2QuantLinear()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
1032327296

Out[1]:

('output/transformers/QLlama-3.2-1B/tokenizer_config.json',
 'output/transformers/QLlama-3.2-1B/special_tokens_map.json',
 'output/transformers/QLlama-3.2-1B/chat_template.jinja',
 'output/transformers/QLlama-3.2-1B/tokenizer.json')

AWQ: Activation-aware Weight Quantization¶

Hypothesize actications determine the importance of each weights
Use calibration data to identify salient channel
Calculate per-channel scaling factors to reduce quantization error.

AWQ in llmcompressor ¶

autoawq has been archived and llmcompressor developed by vLLM take over the function
llmcompressor support a few quantization methods:
- Simple PTQ
- GPTQ
- AWQ
- SmoothQuant
- SparseGPT

In [1]:

Copied!





from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))

# Configure the quantization algorithm to run.
recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]

# Apply algorithms.
oneshot(
    model=model,
    dataset=calibration_dataset,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    output_dir="output/llmcompressor/QLlama-3.2-1B"
)
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(1024))

# Configure the quantization algorithm to run.
recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]

# Apply algorithms.
oneshot(
    model=model,
    dataset=calibration_dataset,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    output_dir="output/llmcompressor/QLlama-3.2-1B"
)

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

Tokenizing:   0%|          | 0/1024 [00:00<?, ? examples/s]

2025-11-20T04:34:49.610464+0100 | reset | INFO - Compression lifecycle reset
2025-11-20T04:34:49.613895+0100 | from_modifiers | INFO - Creating recipe from modifiers
2025-11-20T04:34:49.693089+0100 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...

Resolving mapping 1/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 1068.75it/s]
Resolving mapping 2/4 (15 skipped): 100%|██████████| 16/16 [00:00<00:00, 2230.49it/s]
Resolving mapping 3/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 1327.02it/s]
Resolving mapping 4/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 2238.45it/s]

2025-11-20T04:34:49.757582+0100 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-11-20T04:34:49.758165+0100 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`

Preparing cache: 100%|██████████| 256/256 [00:00<00:00, 1555.82it/s]
(1/17): Calibrating: 100%|██████████| 256/256 [00:02<00:00, 119.49it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.87s/it]
(1/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 351.90it/s]
(2/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 136.26it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.86s/it]
(2/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 466.73it/s]
(3/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 168.07it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.85s/it]
(3/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 484.16it/s]
(4/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 215.13it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.86s/it]
(4/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 407.05it/s]
(5/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 196.82it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.87s/it]
(5/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 432.69it/s]
(6/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 169.70it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.87s/it]
(6/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 479.21it/s]
(7/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 214.30it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.87s/it]
(7/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 438.11it/s]
(8/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 170.15it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.87s/it]
(8/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 445.58it/s]
(9/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 206.62it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.87s/it]
(9/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 425.35it/s]
(10/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 180.31it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.88s/it]
(10/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 477.98it/s]
(11/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 181.03it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.88s/it]
(11/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 424.24it/s]
(12/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 181.42it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.89s/it]
(12/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 437.31it/s]
(13/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 199.29it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.88s/it]
(13/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 434.82it/s]
(14/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 202.33it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.89s/it]
(14/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 463.45it/s]
(15/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 193.22it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.90s/it]
(15/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 451.70it/s]
(16/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 176.80it/s]
Smoothing: 100%|██████████| 3/3 [00:08<00:00,  2.88s/it]
(16/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 400.77it/s]
(17/17): Calibrating: 100%|██████████| 256/256 [00:00<00:00, 283.77it/s]
Smoothing: 0it [00:00, ?it/s]
(17/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 377.33it/s]
Smoothing: 0it [00:00, ?it/s]
Calibrating weights: 100%|██████████| 327/327 [00:01<00:00, 255.26it/s]

2025-11-20T04:37:45.957744+0100 | finalize | INFO - Compression lifecycle finalized for 1 modifiers

2025-11-20T04:37:46.445192+0100 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.

Compressing model: 215it [00:04, 47.70it/s]

Out[1]:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

Other quantization methods supported in Transformers¶

https://huggingface.co/docs/transformers/quantization/overview

Quantization Method	On the fly quantization	CPU	CUDA GPU	ROCm GPU	Metal (Apple Silicon)	Intel GPU	Torch compile()	Bits	PEFT Fine Tuning	Serializable with 🤗Transformers	🤗Transformers Support	Link to library
AQLM	🔴	🟢	🟢	🔴	🔴	🟢	🟢	1/2	🟢	🟢	🟢	https://github.com/Vahe1994/AQLM
AutoRound	🔴	🟢	🟢	🔴	🔴	🟢	🔴	2/3/4/8	🔴	🟢	🟢	https://github.com/intel/auto-round
AWQ	🔴	🟢	🟢	🟢	🔴	🟢	?	4	🟢	🟢	🟢	https://github.com/casper-hansen/AutoAWQ
bitsandbytes	🟢	🟢	🟢	🟡	🟡	🟢	🟢	4/8	🟢	🟢	🟢	https://github.com/bitsandbytes-foundation/bitsandbytes
compressed-tensors	🔴	🟢	🟢	🟢	🔴	🔴	🔴	1/8	🟢	🟢	🟢	https://github.com/neuralmagic/compressed-tensors
EETQ	🟢	🔴	🟢	🔴	🔴	🔴	?	8	🟢	🟢	🟢	https://github.com/NetEase-FuXi/EETQ
FP-Quant	🟢	🔴	🟢	🔴	🔴	🔴	🟢	4	🔴	🟢	🟢	https://github.com/IST-DASLab/FP-Quant
GGUF / GGML (llama.cpp)	🟢	🟢	🟢	🔴	🟢	🟢	🔴	1/8	🔴	See Notes	See Notes	https://github.com/ggerganov/llama.cpp
GPTQModel	🔴	🟢	🟢	🟢	🟢	🟢	🔴	2/3/4/8	🟢	🟢	🟢	https://github.com/ModelCloud/GPTQModel
AutoGPTQ	🔴	🔴	🟢	🟢	🔴	🔴	🔴	2/3/4/8	🟢	🟢	🟢	https://github.com/AutoGPTQ/AutoGPTQ
HIGGS	🟢	🔴	🟢	🔴	🔴	🔴	🟢	2/4	🔴	🟢	🟢	https://github.com/HanGuo97/flute
HQQ	🟢	🟢	🟢	🔴	🔴	🟢	🟢	1/8	🟢	🔴	🟢	https://github.com/mobiusml/hqq/
optimum-quanto	🟢	🟢	🟢	🔴	🟢	🟢	🟢	2/4/8	🔴	🔴	🟢	https://github.com/huggingface/optimum-quanto
FBGEMM_FP8	🟢	🔴	🟢	🔴	🔴	🔴	🔴	8	🔴	🟢	🟢	https://github.com/pytorch/FBGEMM
torchao	🟢	🟢	🟢	🔴	🟡	🟢		4/8		🟢🔴	🟢	https://github.com/pytorch/ao
VPTQ	🔴	🔴	🟢	🟡	🔴	🔴	🟢	1/8	🔴	🟢	🟢	https://github.com/microsoft/VPTQ
FINEGRAINED_FP8	🟢	🔴	🟢	🔴	🔴	🟢	🔴	8	🔴	🟢	🟢
SpQR	🔴	🔴	🟢	🔴	🔴	🔴	🟢	3	🔴	🟢	🟢	https://github.com/Vahe1994/SpQR/
Quark	🔴	🟢	🟢	🟢	🟢	🟢	?	2/4/6/8/9/16	🔴	🔴	🟢	https://quark.docs.amd.com/latest/

Summary¶

We have introduced

Using optimum-quanto to linearly quantize llama3.2-1b in 8-bit
Using bitsandbyte to linearly quantize llama3.2-1b in 8-bit and handle outlier
Using GPTQModel to quantize llama3.2-1b with GPTQ method
Using llmcompressor to quantize llama3.2-1b with AWQ method
Saving quantized models and reloading them

Quantization¶

Outline¶

Quantization Techniques¶

Linear quantization¶

Affine quantization in Quanto (Int8)¶

Quanto integration in Transformers¶

Activation quantiztion / Calibration in Quanto¶

Outlier problem¶

LLM.int8() in Bitsandbytes¶

GPTQ (Generative Pre-trained Transformer Quantizer) Quantization¶

GPTQ in GPTQModel¶

GPTQ integration in Transformers¶

AWQ: Activation-aware Weight Quantization¶

AWQ in llmcompressor¶

Other quantization methods supported in Transformers¶

Summary¶

Reference¶

GPTQ in GPTQModel ¶

AWQ in llmcompressor ¶