Eval harness — recall@k, faithfulness, latency, cost

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Your project's success criteria from task 1 only matter if you can MEASURE them. Build the eval harness: load the golden set, run each query end-to-end, capture retrieved chunks + answer + timing + cost, compute metrics. Run after every change. This IS your spec.

Demo

Four metrics: (1) retrieval@5 — did the gold chunk appear in top-5? (2) Answer accuracy — LLM-judge gives 'does the answer contain expected facts?'. (3) p99 latency — measured end-to-end. (4) cost-per-query — from API usage. Aggregate per change; compare to baseline. CI step blocks PRs that regress any metric > 5%.

Try it yourself

Build the eval harness. Run after every code change.
Set the CI gate: PRs that regress recall@5 OR accuracy by > 5% are blocked.
Promote production traces to the golden set monthly.
Track metrics over time in a chart. Trend matters more than any single measurement.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

What 4 metrics matter for RAG eval? Why each?

2. Why it works (the mechanism)

Walk me through LLM-judge for accuracy. Biases to watch?

3. Advanced — application & what's next

Design a per-doc-type slice of your eval. Why might recall be different for markdown vs PDF?