Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Naive RAG is the floor, not the ceiling — and the floor has predictable holes. (1) Chunking splits a coherent answer across two chunks; only one is retrieved; answer is incomplete. (2) Query mismatch: 'cancel sub' vs corpus 'how to terminate your subscription' — different words, dense embedding helps but isn't magic. (3) Multi-hop: the answer requires combining chunk A (about X) with chunk B (about Y); top-k against single query retrieves only one. (4) Hallucinated paraphrase: chunk says A but model says A-ish. Knowing these failure modes by name makes the rest of this course's techniques make sense — each module fixes one of them.
Anti-patterns to instrument in production: log every retrieval's top-1 similarity score (low = potential failure), log the cosine gap between top-1 and top-5 (small gap = ambiguity), log model refusals separately ('I don't have...' vs hallucinations). Patterns: most query failures cluster around the same 5-10 corpus gaps. Find them in logs, write better chunks for those, evals improve dramatically.
Use these three in order. Each builds on the one before.
Name 4 failure modes of naive RAG. For each, what's the symptom and what technique elsewhere in this course fixes it?
Walk me through the math behind 'top-1 similarity score is a signal'. Why does low top-1 correlate with bad answers, and what does a small top-1-to-top-5 gap tell you?
I have a corpus where 30% of user queries return below-threshold. Help me triage: how many are corpus gaps (need new content), how many are query-vs-doc mismatches (need query rewriting), how many are multi-hop (need decomposition)?
# instrument every RAG call
import logging, json
def answer_with_telemetry(question: str) -> str:
hits = search_with_threshold(question, k=5)
top_sim = hits[0][1] if hits else 0.0
gap = (hits[0][1] - hits[-1][1]) if len(hits) > 1 else 0.0
if not hits:
outcome = "refused_low_similarity"
ans = "I don't have information about that."
else:
ans = answer_from_chunks(question, hits)
outcome = "refused_in_answer" if "don't have" in ans.lower() else "answered"
logging.info(json.dumps({
"event": "rag_query",
"q": question,
"top_sim": float(top_sim),
"gap": float(gap),
"n_above_threshold": len(hits),
"outcome": outcome,
}))
return ans
# audit weekly:
# - what % of queries are below threshold? (refusal rate)
# - what's the distribution of top-1 sim? (corpus coverage signal)
# - which user queries cluster at low sim? (corpus gaps to fill)
# - which queries show small gap? (ambiguity — multi-hop candidates)
# Categorize failures
def categorize_failure(case):
hits = search(case["q"], k=20)
if case["expected_chunk_id"] not in [docs_to_id(h) for h in hits[:5]]:
if case["expected_chunk_id"] in [docs_to_id(h) for h in hits[:20]]:
return "near_miss_reranking_would_help"
else:
return "miss_chunking_or_embedding_problem"
else:
return "retrieved_but_answer_wrong" # prompt or model problempython3 main.py