Capstok — learn by doing

Why this matters

Three forces make enterprise serving hard at once, and they pull against each other. SLAs say a chat endpoint must answer first token under 200ms while an offline embedding job can wait seconds. Mixed frameworks mean a TensorRT-LLM engine and a PyTorch reranker must coexist on the same machine. GPU sharing means you can't afford one GPU per model, so several models must pack onto each device without one starving another. A serving platform has to satisfy latency targets, hide framework differences, and arbitrate a scarce shared GPU simultaneously. Seeing how these three constraints interact is what tells you when you've outgrown a single process and need real orchestration.

Demo

The demo packs several models onto a fixed GPU and checks both the memory fit and whether the latency-sensitive model still gets enough of the device — the core tension of co-location.

Try it yourself

Increase the chat model to 60GB and re-run; identify which models must move to another GPU once memory no longer fits.
Mark the embedding job as a bulk workload and reason about why it should be rate-limited or de-prioritized so it doesn't blow the chat SLA.
Compute leftover GPU memory and decide how many more copies (instances) of the small guard model you could add.
Describe what signal you'd watch to know that co-location is hurting the latency-critical model's p99.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain how SLAs, mixed model frameworks, and GPU sharing each make enterprise model serving harder.

2. Why it works (the mechanism)

Walk me through what actually happens on a shared GPU when a latency-critical model and a bulk batch model run together, and why memory fitting alone isn't enough.

3. Advanced — application & what's next

Given a latency-critical chat model and several bulk models on one GPU, how would I use instance counts, priorities, and rate limits to protect the chat SLA while maximizing utilization?

References

Chat about this lesson

# Can these models share an 80GB GPU while the chat model keeps its SLA?
GPU_GB = 80
models = [
    {"name": "chat-llm",   "gpu_gb": 42, "sla_ms": 200, "latency_critical": True},
    {"name": "embeddings", "gpu_gb": 2,  "sla_ms": 30,  "latency_critical": False},
    {"name": "reranker",   "gpu_gb": 6,  "sla_ms": 50,  "latency_critical": False},
    {"name": "guard",      "gpu_gb": 1,  "sla_ms": 20,  "latency_critical": True},
]
used = sum(m["gpu_gb"] for m in models)
fits = used <= GPU_GB
print(f"memory: {used}/{GPU_GB} GB -> {'fits' if fits else 'OOM'}")
# Memory fitting is necessary but NOT sufficient: under load, batch jobs can
# steal compute from the latency-critical chat model unless we set priorities.
critical = [m["name"] for m in models if m["latency_critical"]]
print(f"needs priority/rate-limit protection so SLA holds: {critical}")

Run: python3 main.py