Capstok — learn by doing

Why this matters

The central idea that makes a serving platform tractable is a single abstraction: a model is a named, versioned thing that takes typed input tensors and returns typed output tensors over a standard protocol. Everything else — which framework runs it, whether it's batched, how many copies exist on which GPU — is configuration behind that contract. This abstraction is what lets a client call 'chat-llm' or 'embeddings' identically without knowing one is TensorRT-LLM and the other is ONNX. Internalizing it early is what makes the rest of this course click: Triton, NIM, and the KServe protocol are all elaborations of 'named model, typed tensors, standard protocol, framework hidden'.

Demo

The demo expresses the abstraction as a tiny interface: a request names a model and version and supplies named tensors; a response returns named output tensors. The same client shape works for any model.

Try it yourself

Add a 'reranker' request with inputs query and documents and confirm it fits the exact same InferRequest shape — the abstraction holds across model types.
Set model_version to a specific value and reason about how the server maps name+version to a concrete loaded instance.
List three things the caller deliberately does NOT specify (framework, batch size, GPU id) and explain why hiding each is what makes the platform manageable.
Sketch how a load balancer or gateway in front could route on model_name alone, given this contract.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain the inference-server abstraction (named versioned model, typed input/output tensors, standard protocol) as if I'm new to it.

2. Why it works (the mechanism)

Walk me through how a single request shape (model name, version, named tensors) lets one client talk to many different model frameworks identically.

3. Advanced — application & what's next

Given this abstraction, how would a serving platform support per-version routing, A/B tests, and framework swaps without changing any client code?

References

Chat about this lesson

# The inference-server contract, distilled. Every model looks the same to the caller.
from dataclasses import dataclass, field

@dataclass
class InferRequest:
    model_name: str
    model_version: str = ""          # "" means "latest ready version"
    inputs: dict = field(default_factory=dict)   # name -> tensor (list/array)

@dataclass
class InferResponse:
    model_name: str
    outputs: dict = field(default_factory=dict)  # name -> tensor

# Caller does NOT know the framework, batching, or GPU placement:
req_embed = InferRequest("embeddings", inputs={"text": ["hello world"]})
req_chat  = InferRequest("chat-llm",   inputs={"prompt": ["Summarize this."], "max_tokens": [128]})
# Same shape, different models. The server resolves name+version -> a loaded backend instance.
print(req_embed.model_name, "and", req_chat.model_name, "share one calling convention")

Run: python3 main.py