Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Your project's success criteria from task 1 only matter if you can MEASURE them. Build the eval harness: load the golden set, run each query end-to-end, capture retrieved chunks + answer + timing + cost, compute metrics. Run after every change. This IS your spec.
Four metrics: (1) retrieval@5 — did the gold chunk appear in top-5? (2) Answer accuracy — LLM-judge gives 'does the answer contain expected facts?'. (3) p99 latency — measured end-to-end. (4) cost-per-query — from API usage. Aggregate per change; compare to baseline. CI step blocks PRs that regress any metric > 5%.
Use these three in order. Each builds on the one before.
What 4 metrics matter for RAG eval? Why each?
Walk me through LLM-judge for accuracy. Biases to watch?
Design a per-doc-type slice of your eval. Why might recall be different for markdown vs PDF?
import json, time
from anthropic import Anthropic
client = Anthropic()
def load_golden(path="evals/golden.jsonl"):
with open(path) as f:
return [json.loads(line) for line in f]
def eval_one(case, search_fn):
t0 = time.perf_counter()
chunks = search_fn(case["question"])
answer = answer_fn(case["question"], chunks)
duration_s = time.perf_counter() - t0
retrieved_ids = {c["id"] for c in chunks[:5]}
expected_ids = set(case["expected_chunk_ids"])
recall_at_5 = len(retrieved_ids & expected_ids) / max(len(expected_ids), 1)
# LLM-judge for answer accuracy
judge = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=10,
messages=[{"role": "user", "content": f"Does this answer contain the expected fact? Yes or No.\n\nAnswer: {answer['text']}\n\nExpected: {case['expected_answer']}"}],
)
accuracy = 1 if "yes" in judge.content[0].text.lower() else 0
return {
"case_id": case["id"],
"recall_at_5": recall_at_5,
"accuracy": accuracy,
"latency_s": duration_s,
"cost_usd": estimate_cost(...),
}
def run_eval():
cases = load_golden()
results = [eval_one(c, hybrid_then_rerank) for c in cases]
return {
"recall_at_5": sum(r["recall_at_5"] for r in results) / len(results),
"accuracy": sum(r["accuracy"] for r in results) / len(results),
"p99_latency_s": sorted([r["latency_s"] for r in results])[int(0.99 * len(results))],
"avg_cost_usd": sum(r["cost_usd"] for r in results) / len(results),
}python3 main.py