Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Every model has a fixed context window measured in tokens, and every token you spend on context is a token you can't spend on output — and one you pay for, and one that adds latency. The instant you think of the window as a budget rather than 'big enough', the right questions appear: what's the highest-value thing to include per token? What can I drop? When do I summarize instead of paste? Engineers who skip this step build features that work in the demo with three documents and fall over in production with three hundred.
Budgeting is arithmetic. Pick your model's window (e.g. 200K), reserve room for output, subtract your fixed overhead (system prompt, tool definitions), and what's left is the variable budget you allocate across history + retrieved context. The helper below makes the budget explicit so you stop guessing.
Use these three in order. Each builds on the one before.
Explain why the context window is best treated as a 'token budget'. What are the four costs of every token I add to a request?
Walk me through how to compute the variable context budget for a model call: window size, output reservation, fixed overhead (system + tools), and what's left for history + retrieval.
I have a chat assistant on an 8K-context open model. Given the tiny budget, design a per-turn allocation strategy across system prompt, recent history, and retrieved context, and say what I'd cut first under pressure.
import anthropic
client = anthropic.Anthropic()
def count(text: str, model="claude-sonnet-4-6") -> int:
# Use the provider's real tokenizer, not len(text)//4
r = client.messages.count_tokens(
model=model, messages=[{"role": "user", "content": text}]
)
return r.input_tokens
WINDOW = 200_000
RESERVED_OUTPUT = 4_000
SYSTEM = "You are a support assistant. Cite sources by [id]."
TOOLS_OVERHEAD = 1_200 # tool schemas, roughly
fixed = count(SYSTEM) + TOOLS_OVERHEAD
variable_budget = WINDOW - RESERVED_OUTPUT - fixed
print(f"variable budget for history + retrieval: {variable_budget:,} tokens")
# Allocate it: e.g. 30% to conversation history, 70% to retrieved docs
hist_budget = int(variable_budget * 0.30)
docs_budget = variable_budget - hist_budget
print(f"history: {hist_budget:,} docs: {docs_budget:,}")python3 main.py