Capstok — learn by doing

Why this matters

When you learn LLM inference you run a single vLLM process serving one model, and it feels like the job is done. But a company doesn't have one model — it has a chat model, an embedding model, a reranker, a moderation classifier, a speech model, three fine-tunes, and last quarter's model still pinned for one customer. Each is a different framework, a different memory footprint, a different SLA. Standing up a bespoke process per model, each with its own port, health check, batching, and deploy pipeline, collapses into unmanageable sprawl. The enterprise serving problem is everything that surrounds the model: standardized deployment, GPU sharing, versioning, observability, and a single inference contract across heterogeneous models. That is why Triton, TensorRT-LLM, and NIM exist.

Demo

The demo lists what a real organization actually has to serve, so you feel the gap between 'run one model' and 'operate a fleet'. The point is the count and the heterogeneity, not any single model.

Try it yourself

Add two more models (a speech model, a vision classifier) to the inventory and recount the distinct frameworks — note how a per-process approach scales linearly in operational surface.
Sum the GPU demand and compare it to a single 80GB GPU; identify which small models could share one GPU instead of each taking a dedicated one.
Mark which two entries are versions of the same model and reason about why you need a versioning abstraction, not two separate deployments.
Write down, for each model, the one operational concern (health, batching, scaling) you would have to re-implement per process — that list is what an inference server gives you for free.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain why a company serving many AI models needs a dedicated inference server instead of just running a separate process per model.

2. Why it works (the mechanism)

Walk me through what operational concerns (deployment, health, batching, versioning, GPU sharing, metrics) get duplicated when each model runs as its own bespoke process, and how a unified inference server consolidates them.

3. Advanced — application & what's next

Given an inventory of 8 models across 4 frameworks with mixed SLAs on a fixed GPU budget, how would I decide which to co-locate, and what does that decision require from the serving layer?

References

Chat about this lesson

# What a single team is actually asked to serve in production:
inventory = [
    {"name": "chat-llm",      "framework": "TensorRT-LLM", "gpu_gb": 40, "sla_ms": 200},
    {"name": "embeddings",    "framework": "ONNX",         "gpu_gb": 2,  "sla_ms": 30},
    {"name": "reranker",      "framework": "PyTorch",      "gpu_gb": 4,  "sla_ms": 50},
    {"name": "safety-guard",  "framework": "ONNX",         "gpu_gb": 1,  "sla_ms": 20},
    {"name": "chat-llm@v3",   "framework": "TensorRT-LLM", "gpu_gb": 40, "sla_ms": 200},  # pinned for one customer
]
# Naive approach: one bespoke server process per row -> 5 ports, 5 health checks,
# 5 batching configs, 5 deploy pipelines, 5 ways to be paged at 3am.
frameworks = {m["framework"] for m in inventory}
print(f"{len(inventory)} models across {len(frameworks)} frameworks: {frameworks}")
print(f"total GPU demand if each gets its own GPU: {sum(m['gpu_gb'] for m in inventory)} GB")
# An inference SERVER hosts all of these behind ONE contract, sharing GPUs.

Run: python3 main.py