Agentic and Applied AI / Course

LLM Inference & Serving Internals

How LLM serving actually works: the autoregressive forward pass, the KV cache, memory-vs-compute bounds, PagedAttention, continuous batching, and streaming — then vLLM internals, the serving-framework landscape, inference optimization, and capacity planning. Learn to serve LLMs fast and cheaply to an SLO.

Free preview

Certificate: 1 of 5 capstones

A Python-first, hands-on course on what happens between your API call and the tokens that come back — and how to serve LLMs efficiently in production. You'll run real models locally and with real vLLM and Hugging Face code, not pseudo-code: measure prefill vs. decode, compute KV-cache memory, ride the roofline, and watch continuous batching and streaming in the engine. The back half goes deep on 2026 practice — PagedAttention, continuous batching, speculative decoding, FP8 and KV-cache quantization, prefix/prompt caching, chunked prefill, and the vLLM/TGI/SGLang/llama.cpp/TensorRT-LLM landscape. You finish able to size GPUs for an SLO, compute cost-per-million-tokens, plan concurrency and autoscaling, and make a defensible build-vs-buy call.

Built by Lakshya Kumar

llm

inference

serving

vllm

gpu

performance

optimization

Before you start4 items

Comfortable in Python; have called an LLM API.
Understand transformers/attention at a high level (or completed an LLM-from-scratch course).
Access to a GPU is helpful (Colab/Kaggle free tier works for most labs).
An API key for the few hosted-API demos (free tier fine).

Is this course for you?Ask an AI

Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.

Get access to LLM Inference & Serving Internals

$3.99

30-day access

Prefer the whole catalog? See all-access membership.

Ask for access

We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.

Capstone projects

Submit any 1 of 5 to earn the certificate

Complete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.

stand-up-and-size-a-vllm-endpointStand up, instrument, and size a vLLM endpoint

Stand up an OpenAI-compatible vLLM endpoint for a model of your choice, instrument TTFT/TPOT/throughput with percentiles, run a load test, and produce a capacity-and-cost report that meets a stated SLO. The report must justify GPU count, per-replica concurrency, autoscaling thresholds, and projected cost-per-million-tokens with measured numbers.

Submit your endpoint + capacity-and-cost reportMinimum rating for approval: 3/5

build-a-serving-benchmark-harnessBuild a reusable serving benchmark harness

Further reading & study material6 sources

Prompt

I'm taking a course on LLM inference and serving internals. Help me tailor it to my situation.

My model: [e.g. Llama 3.1 8B / Qwen2.5 7B / a 70B model]
My hardware: [e.g. one 24GB consumer GPU / an A100 80GB / 2 nodes x 4 H100 / CPU only]
My workload: [e.g. interactive chat / batch summarization / an agent with long shared system prompts]
My SLO: [e.g. p99 TTFT < 1s, p99 TPOT < 50ms / "throughput, latency doesn't matter"]
My priority: [throughput | latency | cost | data control]

For each lesson, ground the concept in my setup: compute my KV-cache size and max concurrency for my model+GPU, tell me whether I'm prefill- or decode-bound for my workload, recommend a serving framework and quantization stack for my priority, and walk me through sizing and cost-per-million-tokens for my SLO. When a choice depends on details I haven't given, ask me before assuming.

Build a benchmarking harness that drives any OpenAI-compatible endpoint with a configurable workload (prompt/output mix, concurrency sweep) and reports req/s, prefill/decode tok/s, and TTFT/TPOT percentiles, plus a throughput-vs-latency frontier plot. It must produce apples-to-apples comparisons across configs or engines.

Submit your benchmark harness + sample reportMinimum rating for approval: 3/5

quantization-quality-speed-studyQuantization quality vs. speed study

Take one model and serve it under fp16 plus a stack of optimizations (weight quantization, KV-cache quantization, attention backend, chunked prefill), measuring decode tok/s, memory, and a task-relevant quality metric at each step. Deliver a Pareto report and a recommended stack that holds quality within a stated budget.

Submit your quality-vs-speed studyMinimum rating for approval: 3/5

serving-framework-bake-offServing-framework bake-off

Benchmark at least two serving frameworks (e.g. vLLM and TGI, optionally SGLang or llama.cpp) on the same model and load test via the OpenAI-compatible API, comparing throughput, latency percentiles, and operational ergonomics. Deliver a recommendation for a stated workload with the runner-up and decision-flip conditions.

Submit your framework bake-off reportMinimum rating for approval: 3/5

capacity-planning-modelCapacity-planning model for an SLO

Build a capacity-planning model that turns an SLO and a traffic forecast into GPU count, per-replica concurrency, autoscaling thresholds, capacity mix, and projected cost-per-million-tokens, validated by a load test and including a build-vs-buy comparison against a hosted API at the forecast volume.

Submit your capacity-planning model + validationMinimum rating for approval: 3/5

The PagedAttention paper: the KV-cache memory management that underpins modern serving.

LLM Inference & Serving Internals

The Inference Forward Pass

The KV Cache

Memory vs. Compute Bounds

PagedAttention & KV Cache Management

Batching Strategies

Streaming & Latency Metrics

vLLM Internals

Serving Frameworks Landscape

Inference Optimization for Serving

Capacity Planning & Cost