Quantization¶
Outline¶
- Quantization Techniques
- Linear Quantization
- GPTQ Quantization
- AWQ Quantization
- Other Methods
- Summary
- Reference
Note: Some examples may take a lot of VRAM. You can restart the kernel once you hit OOM error.
Quantization Techniques¶
- Post trainging quantization (PTQ):
- Post training dynamic quantization: the range for each activation is computed on the fly at runtime.
- Post training static quantization: the range for each activation is computed in advance at quantization-time, typically by passing representative data through the model and recording the activation values.
- Quantization aware training (QAT): the range for each activation is computed at training-time. They simulate the error induced by quantization to let the model be aware of quantization error
Reference: https://huggingface.co/docs/optimum/concept_guides/quantization#calibration
Linear quantization¶
Image source: Maarten Grootendorst
- $x = S * (x_q - Z)$
- When $Z = 0$: symmetrics quantization
- It can be applied per tensor or per channel
Affine quantization in Quanto (Int8)¶
In [1]:
Copied!
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import QuantizedModelForCausalLM, qint8
# https://www.geeksforgeeks.org/nlp/perplexity-for-llm-evaluation/
def compute_perplexity_for_batch(model, tokenizer, input_texts):
inputs = tokenizer(
input_texts, return_tensors="pt", padding=True, truncation=True
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
shift_logits = logits[:, :-1, :]
shift_labels = input_ids[:, 1:]
log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
perplexities = torch.exp(negative_log_likelihood)
mean_perplexity_score = torch.mean(perplexities)
return {
"perplexities": perplexities.tolist(),
"mean_perplexity": mean_perplexity_score.item()
}
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import QuantizedModelForCausalLM, qint8
# https://www.geeksforgeeks.org/nlp/perplexity-for-llm-evaluation/
def compute_perplexity_for_batch(model, tokenizer, input_texts):
inputs = tokenizer(
input_texts, return_tensors="pt", padding=True, truncation=True
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
shift_logits = logits[:, :-1, :]
shift_labels = input_ids[:, 1:]
log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)
target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)
negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)
perplexities = torch.exp(negative_log_likelihood)
mean_perplexity_score = torch.mean(perplexities)
return {
"perplexities": perplexities.tolist(),
"mean_perplexity": mean_perplexity_score.item()
}
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
In [2]:
Copied!
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(model_name)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
Parameter containing:
tensor([[-0.0179, 0.0066, 0.0247, ..., -0.0087, -0.0117, 0.0201],
[ 0.0122, 0.0593, 0.0552, ..., -0.0332, -0.0154, 0.0108],
[ 0.0178, 0.0155, 0.0344, ..., -0.0386, -0.0386, -0.0276],
...,
[ 0.0298, 0.0352, 0.0713, ..., -0.0718, -0.0265, -0.0287],
[ 0.0226, -0.0248, 0.0352, ..., -0.0120, -0.0287, -0.0148],
[-0.0258, -0.0537, -0.0131, ..., 0.0542, 0.0096, -0.0028]],
requires_grad=True)
In [3]:
Copied!
tokenizer = AutoTokenizer.from_pretrained(model_name)
example_texts = [
"Once upon a time, there was a brave knight.",
"In a galaxy far, far away, a new adventure began."
]
# Compute perplexity scores for the batch of input texts
results = compute_perplexity_for_batch(model, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
example_texts = [
"Once upon a time, there was a brave knight.",
"In a galaxy far, far away, a new adventure began."
]
# Compute perplexity scores for the batch of input texts
results = compute_perplexity_for_batch(model, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")
Perplexity scores for each text: [45.347049713134766, 16.073394775390625]
In [4]:
Copied!
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint8, exclude='lm_head')
print(qmodel)
print(qmodel.model.layers[0].self_attn.q_proj.weight)
qmodel.save_pretrained('output/official/QLlama-3.2-1B')
qmodel = QuantizedModelForCausalLM.quantize(model, weights=qint8, exclude='lm_head')
print(qmodel)
print(qmodel.model.layers[0].self_attn.q_proj.weight)
qmodel.save_pretrained('output/official/QLlama-3.2-1B')
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
(k_proj): QLinear(in_features=2048, out_features=512, bias=False)
(v_proj): QLinear(in_features=2048, out_features=512, bias=False)
(o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
<class 'optimum.quanto.tensor.weights.qbytes.WeightQBytesTensor'>(tensor([[-44, 16, 61, ..., -22, -29, 50],
[ 14, 70, 65, ..., -39, -18, 13],
[ 16, 14, 30, ..., -34, -34, -24],
...,
[ 17, 20, 40, ..., -40, -15, -16],
[ 32, -35, 50, ..., -17, -41, -21],
[-14, -30, -7, ..., 30, 5, -2]], dtype=torch.int8), scale=tensor([[0.0004],
[0.0008],
[0.0011],
...,
[0.0018],
[0.0007],
[0.0018]]), dtype=torch.float32)
In [5]:
Copied!
results = compute_perplexity_for_batch(qmodel, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")
results = compute_perplexity_for_batch(qmodel, tokenizer, example_texts)
print(f"Perplexity scores for each text: {results['perplexities']}")
Perplexity scores for each text: [45.38690948486328, 16.23394012451172]
Quanto integration in Transformers¶
In [6]:
Copied!
from transformers import AutoModelForCausalLM, QuantoConfig
quantization_config = QuantoConfig(weights="int8", activations=None)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)
# quanto quantized model cannot be serialized from transformers and cannot be saved
# model.save_pretrained("output/transformers/QLlama-3.2-1B")
from transformers import AutoModelForCausalLM, QuantoConfig
quantization_config = QuantoConfig(weights="int8", activations=None)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
print(model)
print(model.model.layers[0].self_attn.q_proj.weight)
# quanto quantized model cannot be serialized from transformers and cannot be saved
# model.save_pretrained("output/transformers/QLlama-3.2-1B")
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
(k_proj): QLinear(in_features=2048, out_features=512, bias=False)
(v_proj): QLinear(in_features=2048, out_features=512, bias=False)
(o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
<class 'optimum.quanto.tensor.weights.qbytes.WeightQBytesTensor'>(tensor([[-44, 16, 61, ..., -22, -29, 50],
[ 14, 70, 65, ..., -39, -18, 13],
[ 16, 14, 30, ..., -34, -34, -24],
...,
[ 17, 20, 40, ..., -40, -15, -16],
[ 32, -35, 50, ..., -17, -41, -21],
[-14, -30, -7, ..., 30, 5, -2]], dtype=torch.int8), scale=tensor([[0.0004],
[0.0008],
[0.0011],
...,
[0.0018],
[0.0007],
[0.0018]]), dtype=torch.float32)
Activation quantiztion / Calibration in Quanto¶
In [1]:
Copied!
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze, qint8, Calibration, quantization_map
from safetensors.torch import save_file
from datasets import load_dataset
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", use_cache=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
quantize(model, weights=qint8, activations=qint8)
with torch.no_grad(), Calibration(momentum=0.9):
model.eval()
for batch in calibration_dataset.iter(batch_size=2):
inputs = tokenizer(batch["text"], return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
output = model(input_ids, attention_mask=attention_mask)
# good habit
del input_ids, attention_mask
torch.cuda.empty_cache()
print(model)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze, qint8, Calibration, quantization_map
from safetensors.torch import save_file
from datasets import load_dataset
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", use_cache=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
quantize(model, weights=qint8, activations=qint8)
with torch.no_grad(), Calibration(momentum=0.9):
model.eval()
for batch in calibration_dataset.iter(batch_size=2):
inputs = tokenizer(batch["text"], return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
output = model(input_ids, attention_mask=attention_mask)
# good habit
del input_ids, attention_mask
torch.cuda.empty_cache()
print(model)
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
(k_proj): QLinear(in_features=2048, out_features=512, bias=False)
(v_proj): QLinear(in_features=2048, out_features=512, bias=False)
(o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): QLinear(in_features=2048, out_features=128256, bias=False)
)
In [2]:
Copied!
import os
import json
os.makedirs("output/calibration", exist_ok=True)
# Freeze integer weights
freeze(model)
# Serialize quantized model
save_file(model.state_dict(), 'output/calibration/QLlama-3.2-1B/model.safetensors')
# Store the quantized models quantization map
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'w') as f:
json.dump(quantization_map(model), f)
import os
import json
os.makedirs("output/calibration", exist_ok=True)
# Freeze integer weights
freeze(model)
# Serialize quantized model
save_file(model.state_dict(), 'output/calibration/QLlama-3.2-1B/model.safetensors')
# Store the quantized models quantization map
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'w') as f:
json.dump(quantization_map(model), f)
In [3]:
Copied!
from safetensors.torch import load_file
from optimum.quanto import requantize
from transformers import AutoModelForCausalLM, AutoConfig
state_dict = load_file('output/calibration/QLlama-3.2-1B/model.safetensors')
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'r') as f:
quantization_map = json.load(f)
# Create an empty model from your modeling code and requantize it
config = AutoConfig.from_pretrained("/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/config.json")
model = AutoModelForCausalLM.from_config(config)
requantize(model, state_dict, quantization_map, device=torch.device('cuda'))
from safetensors.torch import load_file
from optimum.quanto import requantize
from transformers import AutoModelForCausalLM, AutoConfig
state_dict = load_file('output/calibration/QLlama-3.2-1B/model.safetensors')
with open('output/calibration/QLlama-3.2-1B/quantization_map.json', 'r') as f:
quantization_map = json.load(f)
# Create an empty model from your modeling code and requantize it
config = AutoConfig.from_pretrained("/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/config.json")
model = AutoModelForCausalLM.from_config(config)
requantize(model, state_dict, quantization_map, device=torch.device('cuda'))
In [4]:
Copied!
print(model)
print(model)
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): QLinear(in_features=2048, out_features=2048, bias=False)
(k_proj): QLinear(in_features=2048, out_features=512, bias=False)
(v_proj): QLinear(in_features=2048, out_features=512, bias=False)
(o_proj): QLinear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(up_proj): QLinear(in_features=2048, out_features=8192, bias=False)
(down_proj): QLinear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): QLinear(in_features=2048, out_features=128256, bias=False)
)
Outlier problem¶
- Quanto simply uses
absmax()to calculate scale - Outlier would compress most values to 0
LLM.int8() in Bitsandbytes¶
- Save outlier in another tensor to keep information
- Model is quantized on the fly without loading model in full precision
Image source: Dettmers+2022
In [1]:
Copied!
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=quantization_config,
torch_dtype="auto"
)
print(model)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=quantization_config,
torch_dtype="auto"
)
print(model)
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] /opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
(v_proj): Linear8bitLt(in_features=2048, out_features=512, bias=False)
(o_proj): Linear8bitLt(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear8bitLt(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear8bitLt(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
In [2]:
Copied!
for name, param in model.named_parameters():
if hasattr(param, "SCB"):
print(name)
print(param)
print(param.SCB)
break
print(model.get_memory_footprint() / 1e9)
for name, param in model.named_parameters():
if hasattr(param, "SCB"):
print(name)
print(param)
print(param.SCB)
break
print(model.get_memory_footprint() / 1e9)
model.layers.0.self_attn.q_proj.weight
Parameter containing:
Parameter(Int8Params([[-44, 16, 61, ..., -22, -29, 50],
[ 14, 70, 65, ..., -39, -18, 13],
[ 16, 14, 30, ..., -34, -34, -24],
...,
[ 17, 20, 40, ..., -40, -15, -16],
[ 32, -35, 50, ..., -17, -41, -21],
[-14, -30, -7, ..., 30, 5, -2]], device='cuda:0',
dtype=torch.int8))
tensor([0.0515, 0.1079, 0.1436, ..., 0.2256, 0.0898, 0.2305], device='cuda:0')
1.4985504
GPTQ (Generative Pre-trained Transformer Quantizer) Quantization¶
- Process weights sequentially
- Compensate error induced by currect step by updating the not-yet-quantized weights
In [1]:
Copied!
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoModelForCausalLM
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4)
model = GPTQModel.load(model_name, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2);
model.save("output/official/QLlama-3.2-1B")
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoModelForCausalLM
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4)
model = GPTQModel.load(model_name, quant_config)
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=2);
model.save("output/official/QLlama-3.2-1B")
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
INFO ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving. INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness. INFO Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128] INFO Loader: Auto dtype (native bfloat16): `torch.bfloat16`
INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=128004 (token='<|finetune_right_pad_id|>').
INFO Model: Loaded `generation_config`: GenerationConfig { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [ 128001, 128008, 128009 ], "temperature": 0.6, "top_p": 0.9 } INFO Kernel: loaded -> `[]` INFO Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear` INFO Process: progress logs for `gptq` will be streamed to file: `gptq_log_preexperiment_time_11_19_2025_18h_56m_52s.log` INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | self_attn.k_proj | 0.30117315 | 1024 | 0.01000 | 1.217 | 3.466 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | self_attn.v_proj | 0.00808881 | 1024 | 0.01000 | 0.450 | 3.466 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | self_attn.q_proj | 0.61798751 | 1024 | 0.01000 | 0.454 | 3.466 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | self_attn.o_proj | 0.00077137 | 1024 | 0.01000 | 0.456 | 2.546 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | mlp.up_proj | 0.51245892 | 1024 | 0.01000 | 0.463 | 2.947 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | mlp.gate_proj | 0.64624822 | 1024 | 0.01000 | 0.460 | 2.947 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 0 | mlp.down_proj | 0.00407354 | 1024 | 0.01000 | 2.052 | 7.199 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | self_attn.k_proj | 0.49343121 | 1024 | 0.01000 | 0.454 | 3.005 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | self_attn.v_proj | 0.02866399 | 1024 | 0.01000 | 0.453 | 3.005 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | self_attn.q_proj | 0.99162263 | 1024 | 0.01000 | 0.454 | 3.005 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | self_attn.o_proj | 0.00303621 | 1024 | 0.01000 | 0.456 | 2.275 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | mlp.up_proj | 0.76688707 | 1024 | 0.01000 | 0.467 | 2.669 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | mlp.gate_proj | 1.05201042 | 1024 | 0.01000 | 0.465 | 2.669 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 1 | mlp.down_proj | 1.66648114 | 1024 | 0.01000 | 1.946 | 6.984 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | self_attn.k_proj | 1.03801095 | 1024 | 0.01000 | 0.453 | 3.012 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | self_attn.v_proj | 0.06991096 | 1024 | 0.01000 | 0.453 | 3.012 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | self_attn.q_proj | 1.99752581 | 1024 | 0.01000 | 0.455 | 3.012 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | self_attn.o_proj | 0.00342211 | 1024 | 0.01000 | 0.453 | 2.275 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | mlp.up_proj | 1.00037694 | 1024 | 0.01000 | 0.463 | 2.677 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | mlp.gate_proj | 1.58194685 | 1024 | 0.01000 | 0.463 | 2.677 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 2 | mlp.down_proj | 0.01264102 | 1024 | 0.01000 | 1.944 | 7.000 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | self_attn.k_proj | 0.66864121 | 1024 | 0.01000 | 0.457 | 3.011 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | self_attn.v_proj | 0.09188035 | 1024 | 0.01000 | 0.455 | 3.011 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | self_attn.q_proj | 1.47139096 | 1024 | 0.01000 | 0.454 | 3.011 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | self_attn.o_proj | 0.00652087 | 1024 | 0.01000 | 0.455 | 2.278 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | mlp.up_proj | 1.23837781 | 1024 | 0.01000 | 0.467 | 2.677 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | mlp.gate_proj | 2.43490934 | 1024 | 0.01000 | 0.466 | 2.677 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 3 | mlp.down_proj | 0.01894283 | 1024 | 0.01000 | 1.953 | 6.983 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 4 | self_attn.k_proj | 0.72414231 | 1024 | 0.01000 | 0.458 | 3.012 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 4 | self_attn.v_proj | 0.08650243 | 1024 | 0.01000 | 0.457 | 3.012 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 4 | self_attn.q_proj | 1.49932063 | 1024 | 0.01000 | 0.455 | 3.012 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 4 | self_attn.o_proj | 0.00972512 | 1024 | 0.01000 | 0.456 | 2.284 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 4 | mlp.up_proj | 1.27032411 | 1024 | 0.01000 | 0.467 | 2.678 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 4 | mlp.gate_proj | 2.69963837 | 1024 | 0.01000 | 0.467 | 2.678 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | self_attn.k_proj | 0.84886682 | 1024 | 0.01000 | 0.458 | 3.030 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | self_attn.v_proj | 0.13304198 | 1024 | 0.01000 | 0.454 | 3.030 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | self_attn.q_proj | 2.07612371 | 1024 | 0.01000 | 0.456 | 3.030 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | self_attn.o_proj | 0.02503609 | 1024 | 0.01000 | 0.458 | 2.294 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | mlp.up_proj | 1.74318504 | 1024 | 0.01000 | 0.464 | 2.693 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | mlp.gate_proj | 2.82753754 | 1024 | 0.01000 | 0.461 | 2.693 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 9 | mlp.down_proj | 0.04964223 | 1024 | 0.01000 | 1.957 | 7.048 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | self_attn.k_proj | 1.04286778 | 1024 | 0.01000 | 0.456 | 3.024 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | self_attn.v_proj | 0.16099945 | 1024 | 0.01000 | 0.453 | 3.024 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | self_attn.q_proj | 2.18314767 | 1024 | 0.01000 | 0.453 | 3.024 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | self_attn.o_proj | 0.01785076 | 1024 | 0.01000 | 0.453 | 2.281 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | mlp.up_proj | 2.06139469 | 1024 | 0.01000 | 0.466 | 2.684 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | mlp.gate_proj | 3.23025703 | 1024 | 0.01000 | 0.463 | 2.684 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 10 | mlp.down_proj | 0.06254576 | 1024 | 0.01000 | 1.944 | 7.013 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | self_attn.k_proj | 1.26511097 | 1024 | 0.01000 | 0.455 | 3.022 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | self_attn.v_proj | 0.15488198 | 1024 | 0.01000 | 0.455 | 3.022 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | self_attn.q_proj | 2.15340161 | 1024 | 0.01000 | 0.455 | 3.022 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | self_attn.o_proj | 0.01433891 | 1024 | 0.01000 | 0.456 | 2.291 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | mlp.up_proj | 2.25294113 | 1024 | 0.01000 | 0.470 | 2.694 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | mlp.gate_proj | 3.46270084 | 1024 | 0.01000 | 0.468 | 2.694 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 11 | mlp.down_proj | 0.06779625 | 1024 | 0.01000 | 1.962 | 7.022 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | self_attn.k_proj | 1.29417920 | 1024 | 0.01000 | 0.455 | 3.021 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | self_attn.v_proj | 0.15706307 | 1024 | 0.01000 | 0.454 | 3.021 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | self_attn.q_proj | 2.15997839 | 1024 | 0.01000 | 0.454 | 3.021 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | self_attn.o_proj | 0.01520295 | 1024 | 0.01000 | 0.456 | 2.280 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | mlp.up_proj | 2.36928177 | 1024 | 0.01000 | 0.467 | 2.683 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | mlp.gate_proj | 3.45540619 | 1024 | 0.01000 | 0.465 | 2.683 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 12 | mlp.down_proj | 0.07528380 | 1024 | 0.01000 | 1.945 | 7.010 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | self_attn.k_proj | 1.27076578 | 1024 | 0.01000 | 0.460 | 3.020 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | self_attn.v_proj | 0.26116973 | 1024 | 0.01000 | 0.456 | 3.020 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | self_attn.q_proj | 2.49135661 | 1024 | 0.01000 | 0.456 | 3.020 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | self_attn.o_proj | 0.02025730 | 1024 | 0.01000 | 0.459 | 2.286 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | mlp.up_proj | 2.80010867 | 1024 | 0.01000 | 0.469 | 2.689 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | mlp.gate_proj | 3.75511003 | 1024 | 0.01000 | 0.468 | 2.689 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 13 | mlp.down_proj | 0.11440941 | 1024 | 0.01000 | 1.954 | 7.022 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | self_attn.k_proj | 1.46573043 | 1024 | 0.01000 | 0.460 | 3.029 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | self_attn.v_proj | 0.58527297 | 1024 | 0.01000 | 0.455 | 3.029 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | self_attn.q_proj | 2.69790554 | 1024 | 0.01000 | 0.455 | 3.029 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | self_attn.o_proj | 0.05254138 | 1024 | 0.01000 | 0.459 | 2.285 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | mlp.up_proj | 3.32393456 | 1024 | 0.01000 | 0.468 | 2.686 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | mlp.gate_proj | 4.81758022 | 1024 | 0.01000 | 0.467 | 2.686 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 14 | mlp.down_proj | 0.16269645 | 1024 | 0.01000 | 1.962 | 7.034 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO -------------------------------------------------------------------------------------------------------------------------- INFO | process | layer | module | loss | samples | damp | time | fwd_time | INFO -------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | self_attn.k_proj | 1.39538884 | 1024 | 0.01000 | 0.458 | 3.040 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | self_attn.v_proj | 0.59370136 | 1024 | 0.01000 | 0.454 | 3.040 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | self_attn.q_proj | 2.50607896 | 1024 | 0.01000 | 0.454 | 3.040 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | self_attn.o_proj | 0.18327969 | 1024 | 0.01000 | 0.453 | 2.302 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | mlp.up_proj | 4.28998661 | 1024 | 0.01000 | 0.468 | 2.706 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | mlp.gate_proj | 5.70833874 | 1024 | 0.01000 | 0.467 | 2.706 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO | gptq | 15 | mlp.down_proj | 0.43429565 | 1024 | 0.01000 | 1.955 | 7.111 | INFO -------------------------------------------------------------------------------------------------------------------------------- INFO {'process': 'gptq', 'layer': 0, 'module': 'self_attn.k_proj', 'loss': '0.30117315', 'samples': '1024', 'damp': '0.01000', 'time': '1.217', 'fwd_time': '3.466'} INFO {'process': 'gptq', 'layer': 0, 'module': 'self_attn.v_proj', 'loss': '0.00808881', 'samples': '1024', 'damp': '0.01000', 'time': '0.450', 'fwd_time': '3.466'} INFO {'process': 'gptq', 'layer': 0, 'module': 'self_attn.q_proj', 'loss': '0.61798751', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.466'} INFO {'process': 'gptq', 'layer': 0, 'module': 'self_attn.o_proj', 'loss': '0.00077137', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.546'} INFO {'process': 'gptq', 'layer': 0, 'module': 'mlp.up_proj', 'loss': '0.51245892', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.947'} INFO {'process': 'gptq', 'layer': 0, 'module': 'mlp.gate_proj', 'loss': '0.64624822', 'samples': '1024', 'damp': '0.01000', 'time': '0.460', 'fwd_time': '2.947'} INFO {'process': 'gptq', 'layer': 0, 'module': 'mlp.down_proj', 'loss': '0.00407354', 'samples': '1024', 'damp': '0.01000', 'time': '2.052', 'fwd_time': '7.199'} INFO {'process': 'gptq', 'layer': 1, 'module': 'self_attn.k_proj', 'loss': '0.49343121', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.005'} INFO {'process': 'gptq', 'layer': 1, 'module': 'self_attn.v_proj', 'loss': '0.02866399', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.005'} INFO {'process': 'gptq', 'layer': 1, 'module': 'self_attn.q_proj', 'loss': '0.99162263', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.005'} INFO {'process': 'gptq', 'layer': 1, 'module': 'self_attn.o_proj', 'loss': '0.00303621', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.275'} INFO {'process': 'gptq', 'layer': 1, 'module': 'mlp.up_proj', 'loss': '0.76688707', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.669'} INFO {'process': 'gptq', 'layer': 1, 'module': 'mlp.gate_proj', 'loss': '1.05201042', 'samples': '1024', 'damp': '0.01000', 'time': '0.465', 'fwd_time': '2.669'} INFO {'process': 'gptq', 'layer': 1, 'module': 'mlp.down_proj', 'loss': '1.66648114', 'samples': '1024', 'damp': '0.01000', 'time': '1.946', 'fwd_time': '6.984'} INFO {'process': 'gptq', 'layer': 2, 'module': 'self_attn.k_proj', 'loss': '1.03801095', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.012'} INFO {'process': 'gptq', 'layer': 2, 'module': 'self_attn.v_proj', 'loss': '0.06991096', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.012'} INFO {'process': 'gptq', 'layer': 2, 'module': 'self_attn.q_proj', 'loss': '1.99752581', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.012'} INFO {'process': 'gptq', 'layer': 2, 'module': 'self_attn.o_proj', 'loss': '0.00342211', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '2.275'} INFO {'process': 'gptq', 'layer': 2, 'module': 'mlp.up_proj', 'loss': '1.00037694', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.677'} INFO {'process': 'gptq', 'layer': 2, 'module': 'mlp.gate_proj', 'loss': '1.58194685', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.677'} INFO {'process': 'gptq', 'layer': 2, 'module': 'mlp.down_proj', 'loss': '0.01264102', 'samples': '1024', 'damp': '0.01000', 'time': '1.944', 'fwd_time': '7.000'} INFO {'process': 'gptq', 'layer': 3, 'module': 'self_attn.k_proj', 'loss': '0.66864121', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.011'} INFO {'process': 'gptq', 'layer': 3, 'module': 'self_attn.v_proj', 'loss': '0.09188035', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.011'} INFO {'process': 'gptq', 'layer': 3, 'module': 'self_attn.q_proj', 'loss': '1.47139096', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.011'} INFO {'process': 'gptq', 'layer': 3, 'module': 'self_attn.o_proj', 'loss': '0.00652087', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '2.278'} INFO {'process': 'gptq', 'layer': 3, 'module': 'mlp.up_proj', 'loss': '1.23837781', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.677'} INFO {'process': 'gptq', 'layer': 3, 'module': 'mlp.gate_proj', 'loss': '2.43490934', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.677'} INFO {'process': 'gptq', 'layer': 3, 'module': 'mlp.down_proj', 'loss': '0.01894283', 'samples': '1024', 'damp': '0.01000', 'time': '1.953', 'fwd_time': '6.983'} INFO {'process': 'gptq', 'layer': 4, 'module': 'self_attn.k_proj', 'loss': '0.72414231', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.012'} INFO {'process': 'gptq', 'layer': 4, 'module': 'self_attn.v_proj', 'loss': '0.08650243', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.012'} INFO {'process': 'gptq', 'layer': 4, 'module': 'self_attn.q_proj', 'loss': '1.49932063', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.012'} INFO {'process': 'gptq', 'layer': 4, 'module': 'self_attn.o_proj', 'loss': '0.00972512', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.284'} INFO {'process': 'gptq', 'layer': 4, 'module': 'mlp.up_proj', 'loss': '1.27032411', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.678'} INFO {'process': 'gptq', 'layer': 4, 'module': 'mlp.gate_proj', 'loss': '2.69963837', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.678'} INFO {'process': 'gptq', 'layer': 4, 'module': 'mlp.down_proj', 'loss': '0.02200124', 'samples': '1024', 'damp': '0.01000', 'time': '1.959', 'fwd_time': '6.990'} INFO {'process': 'gptq', 'layer': 5, 'module': 'self_attn.k_proj', 'loss': '1.15480590', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.008'} INFO {'process': 'gptq', 'layer': 5, 'module': 'self_attn.v_proj', 'loss': '0.07708703', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.008'} INFO {'process': 'gptq', 'layer': 5, 'module': 'self_attn.q_proj', 'loss': '1.99378765', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.008'} INFO {'process': 'gptq', 'layer': 5, 'module': 'self_attn.o_proj', 'loss': '0.01000706', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '2.277'} INFO {'process': 'gptq', 'layer': 5, 'module': 'mlp.up_proj', 'loss': '1.38294625', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.673'} INFO {'process': 'gptq', 'layer': 5, 'module': 'mlp.gate_proj', 'loss': '2.51848507', 'samples': '1024', 'damp': '0.01000', 'time': '0.464', 'fwd_time': '2.673'} INFO {'process': 'gptq', 'layer': 5, 'module': 'mlp.down_proj', 'loss': '0.02658287', 'samples': '1024', 'damp': '0.01000', 'time': '1.957', 'fwd_time': '6.980'} INFO {'process': 'gptq', 'layer': 6, 'module': 'self_attn.k_proj', 'loss': '0.88846290', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.018'} INFO {'process': 'gptq', 'layer': 6, 'module': 'self_attn.v_proj', 'loss': '0.09963284', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.018'} INFO {'process': 'gptq', 'layer': 6, 'module': 'self_attn.q_proj', 'loss': '1.42152846', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.018'} INFO {'process': 'gptq', 'layer': 6, 'module': 'self_attn.o_proj', 'loss': '0.01500860', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '2.285'} INFO {'process': 'gptq', 'layer': 6, 'module': 'mlp.up_proj', 'loss': '1.40616870', 'samples': '1024', 'damp': '0.01000', 'time': '0.469', 'fwd_time': '2.686'} INFO {'process': 'gptq', 'layer': 6, 'module': 'mlp.gate_proj', 'loss': '2.51148319', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.686'} INFO {'process': 'gptq', 'layer': 6, 'module': 'mlp.down_proj', 'loss': '0.02717795', 'samples': '1024', 'damp': '0.01000', 'time': '1.951', 'fwd_time': '7.019'} INFO {'process': 'gptq', 'layer': 7, 'module': 'self_attn.k_proj', 'loss': '0.88643467', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.031'} INFO {'process': 'gptq', 'layer': 7, 'module': 'self_attn.v_proj', 'loss': '0.11398130', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.031'} INFO {'process': 'gptq', 'layer': 7, 'module': 'self_attn.q_proj', 'loss': '1.67166734', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.031'} INFO {'process': 'gptq', 'layer': 7, 'module': 'self_attn.o_proj', 'loss': '0.01493945', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '2.292'} INFO {'process': 'gptq', 'layer': 7, 'module': 'mlp.up_proj', 'loss': '1.47412217', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.695'} INFO {'process': 'gptq', 'layer': 7, 'module': 'mlp.gate_proj', 'loss': '2.38286161', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.695'} INFO {'process': 'gptq', 'layer': 7, 'module': 'mlp.down_proj', 'loss': '0.03029270', 'samples': '1024', 'damp': '0.01000', 'time': '1.961', 'fwd_time': '7.045'} INFO {'process': 'gptq', 'layer': 8, 'module': 'self_attn.k_proj', 'loss': '1.08758116', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.028'} INFO {'process': 'gptq', 'layer': 8, 'module': 'self_attn.v_proj', 'loss': '0.10918085', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.028'} INFO {'process': 'gptq', 'layer': 8, 'module': 'self_attn.q_proj', 'loss': '1.79880595', 'samples': '1024', 'damp': '0.01000', 'time': '0.457', 'fwd_time': '3.028'} INFO {'process': 'gptq', 'layer': 8, 'module': 'self_attn.o_proj', 'loss': '0.01832559', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '2.288'} INFO {'process': 'gptq', 'layer': 8, 'module': 'mlp.up_proj', 'loss': '1.64151132', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.690'} INFO {'process': 'gptq', 'layer': 8, 'module': 'mlp.gate_proj', 'loss': '2.58264875', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.690'} INFO {'process': 'gptq', 'layer': 8, 'module': 'mlp.down_proj', 'loss': '0.04075113', 'samples': '1024', 'damp': '0.01000', 'time': '1.955', 'fwd_time': '7.031'} INFO {'process': 'gptq', 'layer': 9, 'module': 'self_attn.k_proj', 'loss': '0.84886682', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.030'} INFO {'process': 'gptq', 'layer': 9, 'module': 'self_attn.v_proj', 'loss': '0.13304198', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.030'} INFO {'process': 'gptq', 'layer': 9, 'module': 'self_attn.q_proj', 'loss': '2.07612371', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.030'} INFO {'process': 'gptq', 'layer': 9, 'module': 'self_attn.o_proj', 'loss': '0.02503609', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '2.294'} INFO {'process': 'gptq', 'layer': 9, 'module': 'mlp.up_proj', 'loss': '1.74318504', 'samples': '1024', 'damp': '0.01000', 'time': '0.464', 'fwd_time': '2.693'} INFO {'process': 'gptq', 'layer': 9, 'module': 'mlp.gate_proj', 'loss': '2.82753754', 'samples': '1024', 'damp': '0.01000', 'time': '0.461', 'fwd_time': '2.693'} INFO {'process': 'gptq', 'layer': 9, 'module': 'mlp.down_proj', 'loss': '0.04964223', 'samples': '1024', 'damp': '0.01000', 'time': '1.957', 'fwd_time': '7.048'} INFO {'process': 'gptq', 'layer': 10, 'module': 'self_attn.k_proj', 'loss': '1.04286778', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.024'} INFO {'process': 'gptq', 'layer': 10, 'module': 'self_attn.v_proj', 'loss': '0.16099945', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.024'} INFO {'process': 'gptq', 'layer': 10, 'module': 'self_attn.q_proj', 'loss': '2.18314767', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '3.024'} INFO {'process': 'gptq', 'layer': 10, 'module': 'self_attn.o_proj', 'loss': '0.01785076', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '2.281'} INFO {'process': 'gptq', 'layer': 10, 'module': 'mlp.up_proj', 'loss': '2.06139469', 'samples': '1024', 'damp': '0.01000', 'time': '0.466', 'fwd_time': '2.684'} INFO {'process': 'gptq', 'layer': 10, 'module': 'mlp.gate_proj', 'loss': '3.23025703', 'samples': '1024', 'damp': '0.01000', 'time': '0.463', 'fwd_time': '2.684'} INFO {'process': 'gptq', 'layer': 10, 'module': 'mlp.down_proj', 'loss': '0.06254576', 'samples': '1024', 'damp': '0.01000', 'time': '1.944', 'fwd_time': '7.013'} INFO {'process': 'gptq', 'layer': 11, 'module': 'self_attn.k_proj', 'loss': '1.26511097', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.022'} INFO {'process': 'gptq', 'layer': 11, 'module': 'self_attn.v_proj', 'loss': '0.15488198', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.022'} INFO {'process': 'gptq', 'layer': 11, 'module': 'self_attn.q_proj', 'loss': '2.15340161', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.022'} INFO {'process': 'gptq', 'layer': 11, 'module': 'self_attn.o_proj', 'loss': '0.01433891', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.291'} INFO {'process': 'gptq', 'layer': 11, 'module': 'mlp.up_proj', 'loss': '2.25294113', 'samples': '1024', 'damp': '0.01000', 'time': '0.470', 'fwd_time': '2.694'} INFO {'process': 'gptq', 'layer': 11, 'module': 'mlp.gate_proj', 'loss': '3.46270084', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.694'} INFO {'process': 'gptq', 'layer': 11, 'module': 'mlp.down_proj', 'loss': '0.06779625', 'samples': '1024', 'damp': '0.01000', 'time': '1.962', 'fwd_time': '7.022'} INFO {'process': 'gptq', 'layer': 12, 'module': 'self_attn.k_proj', 'loss': '1.29417920', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.021'} INFO {'process': 'gptq', 'layer': 12, 'module': 'self_attn.v_proj', 'loss': '0.15706307', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.021'} INFO {'process': 'gptq', 'layer': 12, 'module': 'self_attn.q_proj', 'loss': '2.15997839', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.021'} INFO {'process': 'gptq', 'layer': 12, 'module': 'self_attn.o_proj', 'loss': '0.01520295', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '2.280'} INFO {'process': 'gptq', 'layer': 12, 'module': 'mlp.up_proj', 'loss': '2.36928177', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.683'} INFO {'process': 'gptq', 'layer': 12, 'module': 'mlp.gate_proj', 'loss': '3.45540619', 'samples': '1024', 'damp': '0.01000', 'time': '0.465', 'fwd_time': '2.683'} INFO {'process': 'gptq', 'layer': 12, 'module': 'mlp.down_proj', 'loss': '0.07528380', 'samples': '1024', 'damp': '0.01000', 'time': '1.945', 'fwd_time': '7.010'} INFO {'process': 'gptq', 'layer': 13, 'module': 'self_attn.k_proj', 'loss': '1.27076578', 'samples': '1024', 'damp': '0.01000', 'time': '0.460', 'fwd_time': '3.020'} INFO {'process': 'gptq', 'layer': 13, 'module': 'self_attn.v_proj', 'loss': '0.26116973', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.020'} INFO {'process': 'gptq', 'layer': 13, 'module': 'self_attn.q_proj', 'loss': '2.49135661', 'samples': '1024', 'damp': '0.01000', 'time': '0.456', 'fwd_time': '3.020'} INFO {'process': 'gptq', 'layer': 13, 'module': 'self_attn.o_proj', 'loss': '0.02025730', 'samples': '1024', 'damp': '0.01000', 'time': '0.459', 'fwd_time': '2.286'} INFO {'process': 'gptq', 'layer': 13, 'module': 'mlp.up_proj', 'loss': '2.80010867', 'samples': '1024', 'damp': '0.01000', 'time': '0.469', 'fwd_time': '2.689'} INFO {'process': 'gptq', 'layer': 13, 'module': 'mlp.gate_proj', 'loss': '3.75511003', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.689'} INFO {'process': 'gptq', 'layer': 13, 'module': 'mlp.down_proj', 'loss': '0.11440941', 'samples': '1024', 'damp': '0.01000', 'time': '1.954', 'fwd_time': '7.022'} INFO {'process': 'gptq', 'layer': 14, 'module': 'self_attn.k_proj', 'loss': '1.46573043', 'samples': '1024', 'damp': '0.01000', 'time': '0.460', 'fwd_time': '3.029'} INFO {'process': 'gptq', 'layer': 14, 'module': 'self_attn.v_proj', 'loss': '0.58527297', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.029'} INFO {'process': 'gptq', 'layer': 14, 'module': 'self_attn.q_proj', 'loss': '2.69790554', 'samples': '1024', 'damp': '0.01000', 'time': '0.455', 'fwd_time': '3.029'} INFO {'process': 'gptq', 'layer': 14, 'module': 'self_attn.o_proj', 'loss': '0.05254138', 'samples': '1024', 'damp': '0.01000', 'time': '0.459', 'fwd_time': '2.285'} INFO {'process': 'gptq', 'layer': 14, 'module': 'mlp.up_proj', 'loss': '3.32393456', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.686'} INFO {'process': 'gptq', 'layer': 14, 'module': 'mlp.gate_proj', 'loss': '4.81758022', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.686'} INFO {'process': 'gptq', 'layer': 14, 'module': 'mlp.down_proj', 'loss': '0.16269645', 'samples': '1024', 'damp': '0.01000', 'time': '1.962', 'fwd_time': '7.034'} INFO {'process': 'gptq', 'layer': 15, 'module': 'self_attn.k_proj', 'loss': '1.39538884', 'samples': '1024', 'damp': '0.01000', 'time': '0.458', 'fwd_time': '3.040'} INFO {'process': 'gptq', 'layer': 15, 'module': 'self_attn.v_proj', 'loss': '0.59370136', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.040'} INFO {'process': 'gptq', 'layer': 15, 'module': 'self_attn.q_proj', 'loss': '2.50607896', 'samples': '1024', 'damp': '0.01000', 'time': '0.454', 'fwd_time': '3.040'} INFO {'process': 'gptq', 'layer': 15, 'module': 'self_attn.o_proj', 'loss': '0.18327969', 'samples': '1024', 'damp': '0.01000', 'time': '0.453', 'fwd_time': '2.302'} INFO {'process': 'gptq', 'layer': 15, 'module': 'mlp.up_proj', 'loss': '4.28998661', 'samples': '1024', 'damp': '0.01000', 'time': '0.468', 'fwd_time': '2.706'} INFO {'process': 'gptq', 'layer': 15, 'module': 'mlp.gate_proj', 'loss': '5.70833874', 'samples': '1024', 'damp': '0.01000', 'time': '0.467', 'fwd_time': '2.706'} INFO {'process': 'gptq', 'layer': 15, 'module': 'mlp.down_proj', 'loss': '0.43429565', 'samples': '1024', 'damp': '0.01000', 'time': '1.955', 'fwd_time': '7.111'} INFO Packing model... INFO Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear` INFO Packing Kernel: Auto-selection: adding candidate `TorchQuantLinear` INFO Kernel: candidates -> `[TritonV2QuantLinear, TorchQuantLinear]` INFO Kernel: selected -> `TritonV2QuantLinear`. INFO Model packed. 0% INFO Format: Converting GPTQ v2 to v1 INFO Saved Quantize Config: { "bits": 4, "group_size": 128, "desc_act": true, "sym": true, "lm_head": false, "quant_method": "gptq", "checkpoint_format": "gptq", "pack_dtype": "int32", "meta": { "quantizer": [ "gptqmodel:2.2.0" ], "uri": "https://github.com/modelcloud/gptqmodel", "damp_percent": 0.01, "damp_auto_increment": 0.0025, "static_groups": false, "true_sequential": true, "mse": 0.0 } } Files in directory: quant_log.csv special_tokens_map.json generation_config.json quanto_qmap.json quantize_config.json tokenizer.json tokenizer_config.json README.md model.safetensors chat_template.jinja config.json Content of saved `generation_config.json`: { "bos_token_id": 128000, "do_sample": true, "eos_token_id": [ 128001, 128008, 128009 ], "temperature": 0.6, "top_p": 0.9, "transformers_version": "4.52.4" } Content of saved `config.json`: { "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": [ 128001, 128008, 128009 ], "head_dim": 64, "hidden_act": "silu", "hidden_size": 2048, "initializer_range": 0.02, "intermediate_size": 8192, "max_position_embeddings": 131072, "mlp_bias": false, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 16, "num_key_value_heads": 8, "pretraining_tp": 1, "quantization_config": { "bits": 4, "checkpoint_format": "gptq", "desc_act": true, "group_size": 128, "lm_head": false, "meta": { "damp_auto_increment": 0.0025, "damp_percent": 0.01, "mse": 0.0, "quantizer": [ "gptqmodel:2.2.0" ], "static_groups": false, "true_sequential": true, "uri": "https://github.com/modelcloud/gptqmodel" }, "pack_dtype": "int32", "quant_method": "gptq", "sym": true }, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 32.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "rope_theta": 500000.0, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "transformers_version": "4.52.4", "use_cache": true, "vocab_size": 128256 } INFO Pre-Quantized model size: 2357.14MB, 2.30GB INFO Quantized model size: 984.55MB, 0.96GB INFO Size difference: 1372.59MB, 1.34GB - 58.23%
GPTQ integration in Transformers¶
In [1]:
Copied!
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
tokenizer=AutoTokenizer.from_pretrained(model_name)
gptq_config=GPTQConfig(
bits=4,
dataset="c4", # optimum will download 'en/c4-train.00000-of-01024.json.gz'
tokenizer=tokenizer,
)
quantized_model=AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=gptq_config
)
print(quantized_model)
print(quantized_model.get_memory_footprint())
quantized_model.save_pretrained("output/transformers/QLlama-3.2-1B")
tokenizer.save_pretrained("output/transformers/QLlama-3.2-1B")
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
tokenizer=AutoTokenizer.from_pretrained(model_name)
gptq_config=GPTQConfig(
bits=4,
dataset="c4", # optimum will download 'en/c4-train.00000-of-01024.json.gz'
tokenizer=tokenizer,
)
quantized_model=AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=gptq_config
)
print(quantized_model)
print(quantized_model.get_memory_footprint())
quantized_model.save_pretrained("output/transformers/QLlama-3.2-1B")
tokenizer.save_pretrained("output/transformers/QLlama-3.2-1B")
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
INFO ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving. INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO:accelerate.utils.modeling:We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Quantizing model.layers blocks : 0%| | 0/16 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Start quantizing block model.layers 1/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 1/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 1/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 1/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 1/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 1/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 1/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 1/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 2/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 2/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 2/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 2/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 2/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 2/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 2/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 2/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 3/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 3/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 3/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 3/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 3/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 3/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 3/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 3/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 4/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 4/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 4/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 4/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 4/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 4/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 4/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 4/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 5/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 5/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 5/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 5/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 5/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 5/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 5/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 5/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 6/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 6/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 6/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 6/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 6/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 6/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 6/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 6/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 7/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 7/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 7/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 7/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 7/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 7/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 7/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 7/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 8/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 8/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 8/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 8/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 8/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 8/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 8/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 8/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 9/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 9/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 9/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 9/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 9/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 9/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 9/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 9/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 10/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 10/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 10/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 10/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 10/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 10/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 10/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 10/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 11/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 11/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 11/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 11/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 11/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 11/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 11/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 11/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 12/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 12/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 12/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 12/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 12/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 12/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 12/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 12/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 13/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 13/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 13/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 13/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 13/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 13/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 13/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 13/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 14/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 14/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 14/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 14/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 14/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 14/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 14/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 14/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 15/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 15/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 15/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 15/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 15/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 15/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 15/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 15/16... INFO:optimum.gptq.quantizer:Start quantizing block model.layers 16/16 INFO:optimum.gptq.quantizer:Module to quantize [['self_attn.q_proj'], ['self_attn.k_proj'], ['self_attn.v_proj'], ['self_attn.o_proj'], ['mlp.gate_proj'], ['mlp.up_proj'], ['mlp.down_proj']]
Quantizing layers inside the block: 0%| | 0/7 [00:00<?, ?it/s]
INFO:optimum.gptq.quantizer:Quantizing self_attn.q_proj in block 16/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.k_proj in block 16/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.v_proj in block 16/16... INFO:optimum.gptq.quantizer:Quantizing self_attn.o_proj in block 16/16... INFO:optimum.gptq.quantizer:Quantizing mlp.gate_proj in block 16/16... INFO:optimum.gptq.quantizer:Quantizing mlp.up_proj in block 16/16... INFO:optimum.gptq.quantizer:Quantizing mlp.down_proj in block 16/16... INFO:optimum.gptq.quantizer:Packing model...
INFO Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`. INFO:optimum.gptq.quantizer:model.layers.0.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.0.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.0.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.0.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.0.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.0.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.0.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.1.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.1.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.1.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.1.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.1.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.1.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.1.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.2.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.2.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.2.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.2.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.2.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.2.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.2.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.3.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.3.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.3.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.3.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.3.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.3.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.3.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.4.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.4.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.4.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.4.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.4.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.4.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.4.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.5.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.5.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.5.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.5.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.5.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.5.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.5.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.6.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.6.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.6.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.6.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.6.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.6.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.6.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.7.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.7.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.7.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.7.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.7.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.7.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.7.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.8.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.8.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.8.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.8.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.8.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.8.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.8.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.9.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.9.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.9.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.9.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.9.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.9.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.9.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.10.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.10.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.10.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.10.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.10.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.10.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.10.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.11.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.11.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.11.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.11.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.11.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.11.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.11.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.12.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.12.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.12.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.12.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.12.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.12.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.12.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.13.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.13.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.13.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.13.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.13.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.13.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.13.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.14.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.14.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.14.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.14.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.14.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.14.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.14.mlp.up_proj INFO:optimum.gptq.quantizer:model.layers.15.self_attn.k_proj INFO:optimum.gptq.quantizer:model.layers.15.self_attn.o_proj INFO:optimum.gptq.quantizer:model.layers.15.self_attn.q_proj INFO:optimum.gptq.quantizer:model.layers.15.self_attn.v_proj INFO:optimum.gptq.quantizer:model.layers.15.mlp.down_proj INFO:optimum.gptq.quantizer:model.layers.15.mlp.gate_proj INFO:optimum.gptq.quantizer:model.layers.15.mlp.up_proj INFO:optimum.gptq.quantizer:Model packed.
INFO Optimize: `TritonV2QuantLinear` compilation triggered.
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(k_proj): TritonV2QuantLinear()
(o_proj): TritonV2QuantLinear()
(q_proj): TritonV2QuantLinear()
(v_proj): TritonV2QuantLinear()
)
(mlp): LlamaMLP(
(act_fn): SiLU()
(down_proj): TritonV2QuantLinear()
(gate_proj): TritonV2QuantLinear()
(up_proj): TritonV2QuantLinear()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
1032327296
Out[1]:
('output/transformers/QLlama-3.2-1B/tokenizer_config.json',
'output/transformers/QLlama-3.2-1B/special_tokens_map.json',
'output/transformers/QLlama-3.2-1B/chat_template.jinja',
'output/transformers/QLlama-3.2-1B/tokenizer.json')
AWQ: Activation-aware Weight Quantization¶
- Hypothesize actications determine the importance of each weights
- Use calibration data to identify salient channel
- Calculate per-channel scaling factors to reduce quantization error.
AWQ in llmcompressor¶
- autoawq has been archived and llmcompressor developed by vLLM take over the function
- llmcompressor support a few quantization methods:
- Simple PTQ
- GPTQ
- AWQ
- SmoothQuant
- SparseGPT
In [1]:
Copied!
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
# Configure the quantization algorithm to run.
recipe = [
AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
# Apply algorithms.
oneshot(
model=model,
dataset=calibration_dataset,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
output_dir="output/llmcompressor/QLlama-3.2-1B"
)
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
model_name = "/mimer/NOBACKUP/Datasets/LLM/huggingface/hub/models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e36fa60db5ac69ce42498103dcbcfa836229/"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))
# Configure the quantization algorithm to run.
recipe = [
AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
# Apply algorithms.
oneshot(
model=model,
dataset=calibration_dataset,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
output_dir="output/llmcompressor/QLlama-3.2-1B"
)
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] /opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import]
Tokenizing: 0%| | 0/1024 [00:00<?, ? examples/s]
2025-11-20T04:34:49.610464+0100 | reset | INFO - Compression lifecycle reset 2025-11-20T04:34:49.613895+0100 | from_modifiers | INFO - Creating recipe from modifiers 2025-11-20T04:34:49.693089+0100 | on_initialize | INFO - No AWQModifier.mappings provided, inferring from model...
Resolving mapping 1/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 1068.75it/s] Resolving mapping 2/4 (15 skipped): 100%|██████████| 16/16 [00:00<00:00, 2230.49it/s] Resolving mapping 3/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 1327.02it/s] Resolving mapping 4/4 (0 skipped): 100%|██████████| 16/16 [00:00<00:00, 2238.45it/s]
2025-11-20T04:34:49.757582+0100 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-11-20T04:34:49.758165+0100 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AWQModifier`
Preparing cache: 100%|██████████| 256/256 [00:00<00:00, 1555.82it/s] (1/17): Calibrating: 100%|██████████| 256/256 [00:02<00:00, 119.49it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.87s/it] (1/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 351.90it/s] (2/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 136.26it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.86s/it] (2/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 466.73it/s] (3/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 168.07it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.85s/it] (3/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 484.16it/s] (4/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 215.13it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.86s/it] (4/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 407.05it/s] (5/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 196.82it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.87s/it] (5/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 432.69it/s] (6/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 169.70it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.87s/it] (6/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 479.21it/s] (7/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 214.30it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.87s/it] (7/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 438.11it/s] (8/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 170.15it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.87s/it] (8/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 445.58it/s] (9/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 206.62it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.87s/it] (9/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 425.35it/s] (10/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 180.31it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.88s/it] (10/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 477.98it/s] (11/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 181.03it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.88s/it] (11/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 424.24it/s] (12/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 181.42it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.89s/it] (12/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 437.31it/s] (13/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 199.29it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.88s/it] (13/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 434.82it/s] (14/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 202.33it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.89s/it] (14/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 463.45it/s] (15/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 193.22it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.90s/it] (15/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 451.70it/s] (16/17): Calibrating: 100%|██████████| 256/256 [00:01<00:00, 176.80it/s] Smoothing: 100%|██████████| 3/3 [00:08<00:00, 2.88s/it] (16/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 400.77it/s] (17/17): Calibrating: 100%|██████████| 256/256 [00:00<00:00, 283.77it/s] Smoothing: 0it [00:00, ?it/s] (17/17): Propagating: 100%|██████████| 256/256 [00:00<00:00, 377.33it/s] Smoothing: 0it [00:00, ?it/s] Calibrating weights: 100%|██████████| 327/327 [00:01<00:00, 255.26it/s]
2025-11-20T04:37:45.957744+0100 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2025-11-20T04:37:46.445192+0100 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 215it [00:04, 47.70it/s]
Out[1]:
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(128256, 2048)
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
Other quantization methods supported in Transformers¶
Summary¶
We have introduced
- Using
optimum-quantoto linearly quantize llama3.2-1b in 8-bit - Using
bitsandbyteto linearly quantize llama3.2-1b in 8-bit and handle outlier - Using
GPTQModelto quantize llama3.2-1b with GPTQ method - Using
llmcompressorto quantize llama3.2-1b with AWQ method - Saving quantized models and reloading them
Reference¶
- https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization/notebook
- https://www.kaggle.com/code/aisuko/quantization-methods
- https://www.kaggle.com/code/aisuko/quantization-with-gptq
- https://apxml.com/courses/practical-llm-quantization
- https://github.com/huggingface/optimum-quanto
- https://github.com/bitsandbytes-foundation/bitsandbytes
- https://github.com/ModelCloud/GPTQModel
- https://huggingface.co/docs/transformers/quantization