Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
The central idea that makes a serving platform tractable is a single abstraction: a model is a named, versioned thing that takes typed input tensors and returns typed output tensors over a standard protocol. Everything else — which framework runs it, whether it's batched, how many copies exist on which GPU — is configuration behind that contract. This abstraction is what lets a client call 'chat-llm' or 'embeddings' identically without knowing one is TensorRT-LLM and the other is ONNX. Internalizing it early is what makes the rest of this course click: Triton, NIM, and the KServe protocol are all elaborations of 'named model, typed tensors, standard protocol, framework hidden'.
The demo expresses the abstraction as a tiny interface: a request names a model and version and supplies named tensors; a response returns named output tensors. The same client shape works for any model.
Use these three in order. Each builds on the one before.
Explain the inference-server abstraction (named versioned model, typed input/output tensors, standard protocol) as if I'm new to it.
Walk me through how a single request shape (model name, version, named tensors) lets one client talk to many different model frameworks identically.
Given this abstraction, how would a serving platform support per-version routing, A/B tests, and framework swaps without changing any client code?
# The inference-server contract, distilled. Every model looks the same to the caller.
from dataclasses import dataclass, field
@dataclass
class InferRequest:
model_name: str
model_version: str = "" # "" means "latest ready version"
inputs: dict = field(default_factory=dict) # name -> tensor (list/array)
@dataclass
class InferResponse:
model_name: str
outputs: dict = field(default_factory=dict) # name -> tensor
# Caller does NOT know the framework, batching, or GPU placement:
req_embed = InferRequest("embeddings", inputs={"text": ["hello world"]})
req_chat = InferRequest("chat-llm", inputs={"prompt": ["Summarize this."], "max_tokens": [128]})
# Same shape, different models. The server resolves name+version -> a loaded backend instance.
print(req_embed.model_name, "and", req_chat.model_name, "share one calling convention")python3 main.py