Capstok — learn by doing

Why this matters

Every model has a fixed context window measured in tokens, and every token you spend on context is a token you can't spend on output — and one you pay for, and one that adds latency. The instant you think of the window as a budget rather than 'big enough', the right questions appear: what's the highest-value thing to include per token? What can I drop? When do I summarize instead of paste? Engineers who skip this step build features that work in the demo with three documents and fall over in production with three hundred.

Demo

Budgeting is arithmetic. Pick your model's window (e.g. 200K), reserve room for output, subtract your fixed overhead (system prompt, tool definitions), and what's left is the variable budget you allocate across history + retrieved context. The helper below makes the budget explicit so you stop guessing.

Loading animation…

Try it yourself

Run it for a 200K model and an 8K model. The 8K budget is brutal — that's why local/open models force harder context decisions.
Bump RESERVED_OUTPUT to 16,000 (long answers) and watch the variable budget shrink. Output reservation is a real cost.
Change the 30/70 history/docs split to 70/30 and reason about which feature each split suits (chat assistant vs. document Q&A).
Add a second tool with a big JSON schema and re-measure TOOLS_OVERHEAD — tool definitions are context too.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain why the context window is best treated as a 'token budget'. What are the four costs of every token I add to a request?

2. Why it works (the mechanism)

Walk me through how to compute the variable context budget for a model call: window size, output reservation, fixed overhead (system + tools), and what's left for history + retrieval.

3. Advanced — application & what's next

I have a chat assistant on an 8K-context open model. Given the tiny budget, design a per-turn allocation strategy across system prompt, recent history, and retrieved context, and say what I'd cut first under pressure.

References

Chat about this lesson

import anthropic
client = anthropic.Anthropic()

def count(text: str, model="claude-sonnet-4-6") -> int:
    # Use the provider's real tokenizer, not len(text)//4
    r = client.messages.count_tokens(
        model=model, messages=[{"role": "user", "content": text}]
    )
    return r.input_tokens

WINDOW = 200_000
RESERVED_OUTPUT = 4_000
SYSTEM = "You are a support assistant. Cite sources by [id]."
TOOLS_OVERHEAD = 1_200  # tool schemas, roughly

fixed = count(SYSTEM) + TOOLS_OVERHEAD
variable_budget = WINDOW - RESERVED_OUTPUT - fixed
print(f"variable budget for history + retrieval: {variable_budget:,} tokens")

# Allocate it: e.g. 30% to conversation history, 70% to retrieved docs
hist_budget = int(variable_budget * 0.30)
docs_budget = variable_budget - hist_budget
print(f"history: {hist_budget:,}  docs: {docs_budget:,}")

Run: python3 main.py

The context window as a budget