Serving fine-tuned models — vLLM, TGI, Ollama comparison

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Getting a fine-tuned model to run locally is not the same as serving it to 100 concurrent users with sub-second latency. Production LLM serving is dominated by one bottleneck: KV-cache memory. Each new token generated requires storing attention key/value tensors for all previous tokens — this grows linearly with sequence length and concurrent users. vLLM solved this with PagedAttention (borrowing virtual memory ideas from OS kernels) to efficiently share KV cache across requests. Understanding the serving stack — batching strategies, KV cache management, quantization, tensor parallelism — means you can choose the right serving framework and tune it for your hardware.

Demo

Static batching serializes entire batches: all requests in the batch must finish before any new request enters, so the GPU idles waiting for the slowest sequence. Continuous batching swaps finished sequences out immediately, keeping the batch perpetually full. The throughput difference is not marginal — on a workload with variable output lengths, continuous batching delivers 10–30× more requests per second on the same hardware, which is why vLLM made it the default and why no serious production serving stack uses static batching today.

# Throughput comparison: naive serving vs continuous batching

def estimate_throughput(strategy, num_requests=100, avg_output_tokens=200):
    """Rough throughput model (not a real benchmark — illustrative)."""

    if strategy == "naive_sequential":
        # Process one request at a time
        # Decode speed: ~50 tokens/sec on A100 for a 7B model (memory-bound)
        tokens_per_sec = 50
        time_per_req   = avg_output_tokens / tokens_per_sec
        total_time     = num_requests * time_per_req
        throughput     = num_requests / total_time

    elif strategy == "static_batching":
        # Wait for batch_size requests, run together, wait for slowest
        batch_size     = 8
        # Batch is bottlenecked by longest response in batch — assume 2× average
        tokens_per_sec = 50 * batch_size * 0.7   # ~70% efficiency due to padding waste
        time_per_batch = (avg_output_tokens * 2) / tokens_per_sec
        num_batches    = num_requests / batch_size
        total_time     = num_batches * time_per_batch
        throughput     = num_requests / total_time

    elif strategy == "continuous_batching":
        # vLLM: as soon as one request finishes, insert next into the batch
        # No wasted compute on padding; near-linear scaling with batch size
        tokens_per_sec = 50 * 16 * 0.92   # ~92% GPU utilization
        total_tokens   = num_requests * avg_output_tokens
        total_time     = total_tokens / tokens_per_sec
        throughput     = num_requests / total_time

    return throughput, total_time

for strategy in ["naive_sequential", "static_batching", "continuous_batching"]:
    tps, t = estimate_throughput(strategy)
    print(f"{strategy:<25}  {tps:5.1f} req/s  total={t:5.1f}s")

Run: python3 main.py

Try it yourself

Run vLLM locally (if you have a GPU): pip install vllm; python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --port 8000. Then call it with curl using the OpenAI-compatible format. Measure throughput with wrk or locust.

Compare vLLM vs Ollama for the same model on your hardware. Measure: (1) time-to-first-token for a short prompt, (2) tokens/sec for a 200-token completion, (3) maximum concurrent requests before latency degrades. These three numbers tell you which to use for your workload.

Research PagedAttention: vLLM divides the KV cache into fixed-size pages (blocks) that can be allocated and freed independently, like virtual memory pages. What is the block size vLLM uses by default? How does this prevent KV cache fragmentation that naive pre-allocation would cause?

Set up a TGI (Text Generation Inference) server for the same model. Compare its throughput to vLLM on your hardware. When does TGI outperform vLLM? (Hint: TGI has better support for some model architectures and quantization formats.)

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain what continuous batching is and why it improves LLM serving throughput over static batching. What is the key difference in how requests are processed?

2. Why it works (the mechanism)

Walk me through why KV cache is the dominant memory constraint in LLM serving. For a 7B model with 32 layers, 32 heads, head_dim=128, serving a 2048-token context at batch size 16, compute the KV cache size in GB. How does this limit the maximum concurrent requests on a single A100 80GB?

3. Advanced — application & what's next

I need to serve a fine-tuned Llama-3 8B model to 1000 daily active users with p95 TTFT < 500ms and p95 TPOT < 50ms/token. I have 1× A100 80GB. Walk me through: vLLM configuration (tensor_parallel_size, max_num_seqs, quantization), whether to use AWQ int4, expected throughput, and what happens to latency as concurrent users grows from 1 to 50.