Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
When you learn LLM inference you run a single vLLM process serving one model, and it feels like the job is done. But a company doesn't have one model — it has a chat model, an embedding model, a reranker, a moderation classifier, a speech model, three fine-tunes, and last quarter's model still pinned for one customer. Each is a different framework, a different memory footprint, a different SLA. Standing up a bespoke process per model, each with its own port, health check, batching, and deploy pipeline, collapses into unmanageable sprawl. The enterprise serving problem is everything that surrounds the model: standardized deployment, GPU sharing, versioning, observability, and a single inference contract across heterogeneous models. That is why Triton, TensorRT-LLM, and NIM exist.
The demo lists what a real organization actually has to serve, so you feel the gap between 'run one model' and 'operate a fleet'. The point is the count and the heterogeneity, not any single model.
Use these three in order. Each builds on the one before.
In one paragraph, explain why a company serving many AI models needs a dedicated inference server instead of just running a separate process per model.
Walk me through what operational concerns (deployment, health, batching, versioning, GPU sharing, metrics) get duplicated when each model runs as its own bespoke process, and how a unified inference server consolidates them.
Given an inventory of 8 models across 4 frameworks with mixed SLAs on a fixed GPU budget, how would I decide which to co-locate, and what does that decision require from the serving layer?
# What a single team is actually asked to serve in production:
inventory = [
{"name": "chat-llm", "framework": "TensorRT-LLM", "gpu_gb": 40, "sla_ms": 200},
{"name": "embeddings", "framework": "ONNX", "gpu_gb": 2, "sla_ms": 30},
{"name": "reranker", "framework": "PyTorch", "gpu_gb": 4, "sla_ms": 50},
{"name": "safety-guard", "framework": "ONNX", "gpu_gb": 1, "sla_ms": 20},
{"name": "chat-llm@v3", "framework": "TensorRT-LLM", "gpu_gb": 40, "sla_ms": 200}, # pinned for one customer
]
# Naive approach: one bespoke server process per row -> 5 ports, 5 health checks,
# 5 batching configs, 5 deploy pipelines, 5 ways to be paged at 3am.
frameworks = {m["framework"] for m in inventory}
print(f"{len(inventory)} models across {len(frameworks)} frameworks: {frameworks}")
print(f"total GPU demand if each gets its own GPU: {sum(m['gpu_gb'] for m in inventory)} GB")
# An inference SERVER hosts all of these behind ONE contract, sharing GPUs.python3 main.py