Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Three forces make enterprise serving hard at once, and they pull against each other. SLAs say a chat endpoint must answer first token under 200ms while an offline embedding job can wait seconds. Mixed frameworks mean a TensorRT-LLM engine and a PyTorch reranker must coexist on the same machine. GPU sharing means you can't afford one GPU per model, so several models must pack onto each device without one starving another. A serving platform has to satisfy latency targets, hide framework differences, and arbitrate a scarce shared GPU simultaneously. Seeing how these three constraints interact is what tells you when you've outgrown a single process and need real orchestration.
The demo packs several models onto a fixed GPU and checks both the memory fit and whether the latency-sensitive model still gets enough of the device — the core tension of co-location.
Use these three in order. Each builds on the one before.
Explain how SLAs, mixed model frameworks, and GPU sharing each make enterprise model serving harder.
Walk me through what actually happens on a shared GPU when a latency-critical model and a bulk batch model run together, and why memory fitting alone isn't enough.
Given a latency-critical chat model and several bulk models on one GPU, how would I use instance counts, priorities, and rate limits to protect the chat SLA while maximizing utilization?
# Can these models share an 80GB GPU while the chat model keeps its SLA?
GPU_GB = 80
models = [
{"name": "chat-llm", "gpu_gb": 42, "sla_ms": 200, "latency_critical": True},
{"name": "embeddings", "gpu_gb": 2, "sla_ms": 30, "latency_critical": False},
{"name": "reranker", "gpu_gb": 6, "sla_ms": 50, "latency_critical": False},
{"name": "guard", "gpu_gb": 1, "sla_ms": 20, "latency_critical": True},
]
used = sum(m["gpu_gb"] for m in models)
fits = used <= GPU_GB
print(f"memory: {used}/{GPU_GB} GB -> {'fits' if fits else 'OOM'}")
# Memory fitting is necessary but NOT sufficient: under load, batch jobs can
# steal compute from the latency-critical chat model unless we set priorities.
critical = [m["name"] for m in models if m["latency_critical"]]
print(f"needs priority/rate-limit protection so SLA holds: {critical}")python3 main.py