Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
A 7B-parameter model in fp32 takes 28 GB of GPU memory. In fp16/bf16 it's 14 GB. In int8 it's 7 GB. In int4 it's 3.5 GB. Quantization maps the continuous float weights to a smaller set of discrete values, dramatically reducing memory and increasing throughput — at the cost of some accuracy. Understanding the tradeoffs between quantization schemes (post-training quantization vs quantization-aware training, weight-only vs activation quantization, per-tensor vs per-channel calibration) means you can run a 70B model on a single GPU, reduce inference latency by 2–4×, and make conscious tradeoffs between serving cost and model quality.
Quantization maps each fp32 weight to the nearest representable int8 value using a scale factor that spans the weight's dynamic range. The precision loss is proportional to that range divided by 255 discrete levels — which is why a single large outlier in the weight matrix destroys resolution for all other weights. The code below shows the symmetric per-tensor path and then demonstrates the outlier problem that motivated LLM.int8()'s mixed-precision decomposition.
import torch
import numpy as np
torch.manual_seed(42)
W = torch.randn(64, 64) # a weight matrix in fp32
def quantize_int8(W):
"""Symmetric per-tensor int8 quantization."""
abs_max = W.abs().max()
scale = abs_max / 127.0 # map max to 127
W_int8 = (W / scale).round().clamp(-128, 127).to(torch.int8)
return W_int8, scale
def dequantize(W_int8, scale):
return W_int8.to(torch.float32) * scale
W_int8, scale = quantize_int8(W)
W_reconstructed = dequantize(W_int8, scale)
error = (W - W_reconstructed).abs()
print(f"Weight dtype: fp32={W.element_size()}B → int8={W_int8.element_size()}B")
print(f"Memory: {W.numel()*4/1024:.1f} KB → {W_int8.numel()*1/1024:.1f} KB (4× reduction)")
print(f"Max quantization error: {error.max().item():.6f}")
print(f"Mean absolute error: {error.mean().item():.6f}")
print(f"Scale factor: {scale:.6f}")
# Outlier problem: large outliers force a large scale, wasting resolution on small weights
W_with_outlier = W.clone()
W_with_outlier[0, 0] = 1000.0 # one outlier
_, scale_bad = quantize_int8(W_with_outlier)
print(f"\nScale with outlier: {scale_bad:.3f} (vs {scale:.3f} without)")
print("All weights compressed into ±127 steps spanning ±1000 instead of ±4 — precision destroyed")python3 main.pyscale for each row of W. How does the mean absolute error change? This is why per-channel quantization is standard in GPTQ.model = AutoModelForCausalLM.from_pretrained('gpt2', load_in_8bit=True). Compare model.get_memory_footprint() to fp32 and fp16 versions. Verify the 2× reduction.Use these three in order. Each builds on the one before.
In one paragraph, explain what weight quantization does to a model. Why does int4 use 4× less memory than fp16, and what accuracy is lost? How do you measure that accuracy loss?
Walk me through why activation outliers (a small number of very large activation values) destroy the accuracy of naive int8 quantization for LLMs. What is the LLM.int8() solution (mixed-precision decomposition), and why doesn't this problem affect traditional CNNs as severely?
I need to serve a Llama-3 70B model at ≥30 tokens/second on 2× A100 80GB GPUs. Walk me through my options: (1) bf16 tensor-parallel, (2) int8 with bitsandbytes, (3) GPTQ int4, (4) AWQ int4. For each: expected VRAM usage, expected token throughput on this hardware, perplexity penalty on a standard benchmark, and implementation complexity with vLLM.