Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Getting a fine-tuned model to run locally is not the same as serving it to 100 concurrent users with sub-second latency. Production LLM serving is dominated by one bottleneck: KV-cache memory. Each new token generated requires storing attention key/value tensors for all previous tokens — this grows linearly with sequence length and concurrent users. vLLM solved this with PagedAttention (borrowing virtual memory ideas from OS kernels) to efficiently share KV cache across requests. Understanding the serving stack — batching strategies, KV cache management, quantization, tensor parallelism — means you can choose the right serving framework and tune it for your hardware.
Static batching serializes entire batches: all requests in the batch must finish before any new request enters, so the GPU idles waiting for the slowest sequence. Continuous batching swaps finished sequences out immediately, keeping the batch perpetually full. The throughput difference is not marginal — on a workload with variable output lengths, continuous batching delivers 10–30× more requests per second on the same hardware, which is why vLLM made it the default and why no serious production serving stack uses static batching today.
# Throughput comparison: naive serving vs continuous batching
def estimate_throughput(strategy, num_requests=100, avg_output_tokens=200):
"""Rough throughput model (not a real benchmark — illustrative)."""
if strategy == "naive_sequential":
# Process one request at a time
# Decode speed: ~50 tokens/sec on A100 for a 7B model (memory-bound)
tokens_per_sec = 50
time_per_req = avg_output_tokens / tokens_per_sec
total_time = num_requests * time_per_req
throughput = num_requests / total_time
elif strategy == "static_batching":
# Wait for batch_size requests, run together, wait for slowest
batch_size = 8
# Batch is bottlenecked by longest response in batch — assume 2× average
tokens_per_sec = 50 * batch_size * 0.7 # ~70% efficiency due to padding waste
time_per_batch = (avg_output_tokens * 2) / tokens_per_sec
num_batches = num_requests / batch_size
total_time = num_batches * time_per_batch
throughput = num_requests / total_time
elif strategy == "continuous_batching":
# vLLM: as soon as one request finishes, insert next into the batch
# No wasted compute on padding; near-linear scaling with batch size
tokens_per_sec = 50 * 16 * 0.92 # ~92% GPU utilization
total_tokens = num_requests * avg_output_tokens
total_time = total_tokens / tokens_per_sec
throughput = num_requests / total_time
return throughput, total_time
for strategy in ["naive_sequential", "static_batching", "continuous_batching"]:
tps, t = estimate_throughput(strategy)
print(f"{strategy:<25} {tps:5.1f} req/s total={t:5.1f}s")python3 main.pypip install vllm; python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --port 8000. Then call it with curl using the OpenAI-compatible format. Measure throughput with wrk or locust.Use these three in order. Each builds on the one before.
In one paragraph, explain what continuous batching is and why it improves LLM serving throughput over static batching. What is the key difference in how requests are processed?
Walk me through why KV cache is the dominant memory constraint in LLM serving. For a 7B model with 32 layers, 32 heads, head_dim=128, serving a 2048-token context at batch size 16, compute the KV cache size in GB. How does this limit the maximum concurrent requests on a single A100 80GB?
I need to serve a fine-tuned Llama-3 8B model to 1000 daily active users with p95 TTFT < 500ms and p95 TPOT < 50ms/token. I have 1× A100 80GB. Walk me through: vLLM configuration (tensor_parallel_size, max_num_seqs, quantization), whether to use AWQ int4, expected throughput, and what happens to latency as concurrent users grows from 1 to 50.