Engineer the context window as a budgeted resource: select, order, compress, cache, and measure what the model sees. The discipline underneath every reliable RAG, agent, and long-session assistant.
Ten modules, ~100 challenges on the discipline that decides whether an LLM feature is reliable: what goes into the context window, in what order, how it's compressed and cached, and how you measure it. Goes beyond prompt engineering into the anatomy of the window (roles, tokens, position effects), retrieval as selection, three-tier memory, tool-result and structured context, compression, prompt + semantic caching, long-context strategy, and a full eval/regression discipline. Python-first with runnable code against real provider APIs (Anthropic + OpenAI), and built on 2026 production practice: prompt caching, citations, lost-in-the-middle, and LLM-as-judge evals done right.
Built by Lakshya Kumar
We grant free access case-by-case — students, career-switchers, builders on a tight budget. Sign in to send us a note.
Sign in to applyComplete all modules, then submit the required number of capstone projects. Each must earn a passing rating from an admin reviewer.
Build a complete context pipeline for a real corpus and use case: gather → score → select-under-budget → order → format, with dedup, recency, retrieval + rerank, three-tier memory, compression with fidelity guarantees, and a layered cache. Ship it behind an API and prove quality with an eval harness (retrieval + generation metrics) and a CI regression gate. Submit the repo, the eval report, and a cost/latency breakdown.
Paste this into any AI chat. Fill in the bracketed parts with your context — you'll get back a straight answer on whether this belongs on your plate.
I'm taking a "Context Engineering" course — engineering the LLM context window as a budgeted resource: what to include, in what order, how to compress and cache it, and how to measure it. It covers the anatomy of the window (roles, tokens, position effects), retrieval as selection, three-tier memory, tool results, compression, prompt + semantic caching, long-context strategy, and evals/regression gates. Python-first against Anthropic + OpenAI. Here's my context: 1. What I'm building: [describe the feature/product] 2. My current context approach: [naive concatenation / basic RAG / agent / long-context stuffing] 3. Where it's failing: [inconsistent answers / too expensive / too slow / hallucinations / forgets things] 4. My model + window: [model and context size] Given that, answer: - Which module should I prioritize and why? - Which of the five levers (select / order / compress / cache / measure) is most likely my bottleneck? - Name 3 concrete changes I could make this week, and how I'd measure that each one helped. - Name 1 thing this course won't fix so I have the right expectations.
Build a reusable eval pack another team could drop onto their RAG/agent: labeled-set tooling, retrieval + generation metrics, a calibrated LLM-as-judge, ablation runner, online-signal ingester, and a CI gate. Submit it as a small package or repo with docs.
Ship a memory system (short-term + summarized + long-term) with salience extraction, relevance-gated recall, PII redaction, and cross-session continuity, integrated into a working assistant. Submit the live demo + a writeup of what it remembers and forgets and why.
Build prompt caching + semantic caching + embedding caching for a real workload with version-based invalidation, then report measured hit rate, cost savings, latency reduction, and false-hit rate over real traffic. Submit the implementation + the metrics dashboard.
Pick a real corpus and produce a rigorous decision report: needle-in-haystack recall map for your model, long-context and retrieval (and hybrid) implementations, and an accuracy/latency/cost comparison ending in a defensible recommendation. Submit the benchmark code + report.
The position-effects paper behind Modules 2 and 9. Required reading.