How LLM serving actually works: the autoregressive forward pass, the KV cache, memory-vs-compute bounds, PagedAttention, continuous batching, and streaming — then vLLM internals, the serving-framework landscape, inference optimization, and capacity planning. Learn to serve LLMs fast and cheaply to an SLO.
A Python-first, hands-on course on what happens between your API call and the tokens that come back — and how to serve LLMs efficiently in production. You'll run real models locally and with real vLLM and Hugging Face code, not pseudo-code: measure prefill vs. decode, compute KV-cache memory, ride the roofline, and watch continuous batching and streaming in the engine. The back half goes deep on 2026 practice — PagedAttention, continuous batching, speculative decoding, FP8 and KV-cache quantization, prefix/prompt caching, chunked prefill, and the vLLM/TGI/SGLang/llama.cpp/TensorRT-LLM landscape. You finish able to size GPUs for an SLO, compute cost-per-million-tokens, plan concurrency and autoscaling, and make a defensible build-vs-buy call.
Built by Lakshya Kumar
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Stand up an OpenAI-compatible vLLM endpoint for a model of your choice, instrument TTFT/TPOT/throughput with percentiles, run a load test, and produce a capacity-and-cost report that meets a stated SLO. The report must justify GPU count, per-replica concurrency, autoscaling thresholds, and projected cost-per-million-tokens with measured numbers.
I'm taking a course on LLM inference and serving internals. Help me tailor it to my situation. My model: [e.g. Llama 3.1 8B / Qwen2.5 7B / a 70B model] My hardware: [e.g. one 24GB consumer GPU / an A100 80GB / 2 nodes x 4 H100 / CPU only] My workload: [e.g. interactive chat / batch summarization / an agent with long shared system prompts] My SLO: [e.g. p99 TTFT < 1s, p99 TPOT < 50ms / "throughput, latency doesn't matter"] My priority: [throughput | latency | cost | data control] For each lesson, ground the concept in my setup: compute my KV-cache size and max concurrency for my model+GPU, tell me whether I'm prefill- or decode-bound for my workload, recommend a serving framework and quantization stack for my priority, and walk me through sizing and cost-per-million-tokens for my SLO. When a choice depends on details I haven't given, ask me before assuming.
Build a benchmarking harness that drives any OpenAI-compatible endpoint with a configurable workload (prompt/output mix, concurrency sweep) and reports req/s, prefill/decode tok/s, and TTFT/TPOT percentiles, plus a throughput-vs-latency frontier plot. It must produce apples-to-apples comparisons across configs or engines.
Take one model and serve it under fp16 plus a stack of optimizations (weight quantization, KV-cache quantization, attention backend, chunked prefill), measuring decode tok/s, memory, and a task-relevant quality metric at each step. Deliver a Pareto report and a recommended stack that holds quality within a stated budget.
Benchmark at least two serving frameworks (e.g. vLLM and TGI, optionally SGLang or llama.cpp) on the same model and load test via the OpenAI-compatible API, comparing throughput, latency percentiles, and operational ergonomics. Deliver a recommendation for a stated workload with the runner-up and decision-flip conditions.
Build a capacity-planning model that turns an SLO and a traffic forecast into GPU count, per-replica concurrency, autoscaling thresholds, capacity mix, and projected cost-per-million-tokens, validated by a load test and including a build-vs-buy comparison against a hosted API at the forecast volume.
The PagedAttention paper: the KV-cache memory management that underpins modern serving.