The engineering take on AI — context, RAG, agents, serving, and the production lifecycle.
The engineering path through applied AI. Manage the context window as a resource, build retrieval and production agents, then scale, serve, and operate them. Covers context engineering, RAG, production agents, inference and serving internals, enterprise serving stacks, and the full MLOps/LLMOps lifecycle. Python-first with runnable code, and three labelled prompts per task so your AI teaches the mechanism instead of hand-waving it.
Engineer the context window as a budgeted resource: select, order, compress, cache, and measure what the model sees. The discipline underneath every reliable RAG, agent, and long-session assistant.
Every level of RAG — from naive cosine-on-embeddings to hybrid + reranking + multi-hop + agentic + production DevOps. Each module is the next move when the previous one breaks.
Context engineering for agentic, multi-turn, multi-agent, and adversarial systems at scale: multi-agent context, dynamic assembly, hierarchical memory, KV-cache-aware design, tool-loop management, long-horizon memory, security, multi-modal, and a context platform.
End-to-end: tools, memory, planning, guardrails, evals, observability, cost, safety, shipping. Build an agent that survives real users.
From your first user to your millionth. Cost, latency, observability, queues, multi-model, vector DBs, on-call, deploys, compliance — at every scale milestone.
Ten end-to-end AI projects. Each module is a full build with scope, code, eval, and deployment. Hybrid RAG, multi-modal, text-to-SQL, eval gen, semantic cache, doc extraction, orchestration, browser agent, voice, fine-tuning.
How LLM serving actually works: the autoregressive forward pass, the KV cache, memory-vs-compute bounds, PagedAttention, continuous batching, and streaming — then vLLM internals, the serving-framework landscape, inference optimization, and capacity planning. Learn to serve LLMs fast and cheaply to an SLO.
Serve a real fleet of models the way companies actually do it: NVIDIA Triton, TensorRT-LLM, and NIM, plus graph compilation, ensembles/BLS, multi-model GPU sharing, observability, and production deployment.
Run ML and LLM systems in production the way it's actually done: experiment tracking, model/prompt registries with lineage, CI/CD with eval gates, safe deployment patterns, automated LLM evals, observability, guardrails, and a continuous improvement loop. The discipline that turns a working demo into a system you can trust at scale.