Quantization — int8, int4, GPTQ, AWQ, bitsandbytes

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

A 7B-parameter model in fp32 takes 28 GB of GPU memory. In fp16/bf16 it's 14 GB. In int8 it's 7 GB. In int4 it's 3.5 GB. Quantization maps the continuous float weights to a smaller set of discrete values, dramatically reducing memory and increasing throughput — at the cost of some accuracy. Understanding the tradeoffs between quantization schemes (post-training quantization vs quantization-aware training, weight-only vs activation quantization, per-tensor vs per-channel calibration) means you can run a 70B model on a single GPU, reduce inference latency by 2–4×, and make conscious tradeoffs between serving cost and model quality.

Demo

Quantization maps each fp32 weight to the nearest representable int8 value using a scale factor that spans the weight's dynamic range. The precision loss is proportional to that range divided by 255 discrete levels — which is why a single large outlier in the weight matrix destroys resolution for all other weights. The code below shows the symmetric per-tensor path and then demonstrates the outlier problem that motivated LLM.int8()'s mixed-precision decomposition.

import torch
import numpy as np

torch.manual_seed(42)
W = torch.randn(64, 64)  # a weight matrix in fp32

def quantize_int8(W):
    """Symmetric per-tensor int8 quantization."""
    abs_max = W.abs().max()
    scale   = abs_max / 127.0           # map max to 127
    W_int8  = (W / scale).round().clamp(-128, 127).to(torch.int8)
    return W_int8, scale

def dequantize(W_int8, scale):
    return W_int8.to(torch.float32) * scale

W_int8, scale = quantize_int8(W)
W_reconstructed = dequantize(W_int8, scale)

error = (W - W_reconstructed).abs()
print(f"Weight dtype: fp32={W.element_size()}B → int8={W_int8.element_size()}B")
print(f"Memory: {W.numel()*4/1024:.1f} KB → {W_int8.numel()*1/1024:.1f} KB  (4× reduction)")
print(f"Max quantization error: {error.max().item():.6f}")
print(f"Mean absolute error:    {error.mean().item():.6f}")
print(f"Scale factor:           {scale:.6f}")

# Outlier problem: large outliers force a large scale, wasting resolution on small weights
W_with_outlier = W.clone()
W_with_outlier[0, 0] = 1000.0   # one outlier
_, scale_bad = quantize_int8(W_with_outlier)
print(f"\nScale with outlier: {scale_bad:.3f}  (vs {scale:.3f} without)")
print("All weights compressed into ±127 steps spanning ±1000 instead of ±4 — precision destroyed")

Run: python3 main.py

Try it yourself

Run the outlier example and compute the mean absolute error on the weights excluding the outlier. Compare to the no-outlier case. This demonstrates exactly why activation outliers are such a problem for naive int8 quantization and why LLM.int8() uses mixed-precision for outlier columns.

Change from per-tensor to per-row quantization: compute a separate scale for each row of W. How does the mean absolute error change? This is why per-channel quantization is standard in GPTQ.

Install bitsandbytes and load a model in 8-bit: model = AutoModelForCausalLM.from_pretrained('gpt2', load_in_8bit=True). Compare model.get_memory_footprint() to fp32 and fp16 versions. Verify the 2× reduction.

Research GPTQ vs AWQ: GPTQ uses second-order Hessian information to choose which weights to round up vs down. AWQ activates-guided quantization to protect salient weights. For a 7B model, which produces better perplexity at 4-bit, and how much does calibration data size affect the result?

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what weight quantization does to a model. Why does int4 use 4× less memory than fp16, and what accuracy is lost? How do you measure that accuracy loss?

2. Why it works (the mechanism)

Walk me through why activation outliers (a small number of very large activation values) destroy the accuracy of naive int8 quantization for LLMs. What is the LLM.int8() solution (mixed-precision decomposition), and why doesn't this problem affect traditional CNNs as severely?

3. Advanced — application & what's next

I need to serve a Llama-3 70B model at ≥30 tokens/second on 2× A100 80GB GPUs. Walk me through my options: (1) bf16 tensor-parallel, (2) int8 with bitsandbytes, (3) GPTQ int4, (4) AWQ int4. For each: expected VRAM usage, expected token throughput on this hardware, perplexity penalty on a standard benchmark, and implementation complexity with vLLM.