Capstok — learn by doing

Why this matters

Naive RAG is the floor, not the ceiling — and the floor has predictable holes. (1) Chunking splits a coherent answer across two chunks; only one is retrieved; answer is incomplete. (2) Query mismatch: 'cancel sub' vs corpus 'how to terminate your subscription' — different words, dense embedding helps but isn't magic. (3) Multi-hop: the answer requires combining chunk A (about X) with chunk B (about Y); top-k against single query retrieves only one. (4) Hallucinated paraphrase: chunk says A but model says A-ish. Knowing these failure modes by name makes the rest of this course's techniques make sense — each module fixes one of them.

Demo

Anti-patterns to instrument in production: log every retrieval's top-1 similarity score (low = potential failure), log the cosine gap between top-1 and top-5 (small gap = ambiguity), log model refusals separately ('I don't have...' vs hallucinations). Patterns: most query failures cluster around the same 5-10 corpus gaps. Find them in logs, write better chunks for those, evals improve dramatically.

Try it yourself

Add the telemetry above to every RAG call. After a week, look at the distribution of top_sim and outcome.
For each failing eval case, run the categorization. Most cluster into 2-3 root causes. Fix the dominant one first.
Build a 'corpus gaps' weekly review: queries below threshold = topics your corpus doesn't cover well. Each is a doc to write or chunk to add.
Pre-write the system prompt for each failure mode (multi-hop, ambiguous, near-miss). Test the prompt against real failures.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Name 4 failure modes of naive RAG. For each, what's the symptom and what technique elsewhere in this course fixes it?

2. Why it works (the mechanism)

Walk me through the math behind 'top-1 similarity score is a signal'. Why does low top-1 correlate with bad answers, and what does a small top-1-to-top-5 gap tell you?

3. Advanced — application & what's next

I have a corpus where 30% of user queries return below-threshold. Help me triage: how many are corpus gaps (need new content), how many are query-vs-doc mismatches (need query rewriting), how many are multi-hop (need decomposition)?

References

Chat about this lesson

# instrument every RAG call
import logging, json

def answer_with_telemetry(question: str) -> str:
    hits = search_with_threshold(question, k=5)
    top_sim = hits[0][1] if hits else 0.0
    gap = (hits[0][1] - hits[-1][1]) if len(hits) > 1 else 0.0

    if not hits:
        outcome = "refused_low_similarity"
        ans = "I don't have information about that."
    else:
        ans = answer_from_chunks(question, hits)
        outcome = "refused_in_answer" if "don't have" in ans.lower() else "answered"

    logging.info(json.dumps({
        "event": "rag_query",
        "q": question,
        "top_sim": float(top_sim),
        "gap": float(gap),
        "n_above_threshold": len(hits),
        "outcome": outcome,
    }))
    return ans

# audit weekly:
# - what % of queries are below threshold?  (refusal rate)
# - what's the distribution of top-1 sim?   (corpus coverage signal)
# - which user queries cluster at low sim?  (corpus gaps to fill)
# - which queries show small gap?           (ambiguity — multi-hop candidates)

# Categorize failures
def categorize_failure(case):
    hits = search(case["q"], k=20)
    if case["expected_chunk_id"] not in [docs_to_id(h) for h in hits[:5]]:
        if case["expected_chunk_id"] in [docs_to_id(h) for h in hits[:20]]:
            return "near_miss_reranking_would_help"
        else:
            return "miss_chunking_or_embedding_problem"
    else:
        return "retrieved_but_answer_wrong"  # prompt or model problem

Run: python3 main.py

The failure modes of naive RAG