Capstok — learn by doing

Why this matters

Naive RAG fetches top-k unconditionally — even when nothing in your corpus matches. The model dutifully answers from irrelevant context and hallucinates a confident-sounding lie. The fix is to enforce a similarity threshold: only inject chunks above a cosine threshold (e.g. 0.7); if no chunks pass, the answer is 'I don't know' or 'I don't have that information'. This single rule turns RAG from 'pretty good when it works' into 'reliable in production'. The hard part is finding the right threshold — it varies by embedding model and domain.

Demo

How to find your threshold: take your eval set (positives + negatives), compute the cosine of each pair, plot the distribution. The right threshold is the value that separates positives from negatives. Usually 0.6-0.8 for cosine with modern embeddings. Don't pick from a tutorial — measure on YOUR data. Variable thresholds (e.g. 'top result must beat 0.75; followups can be 0.65') work even better.

Try it yourself

Add a similarity threshold to your naive RAG. Ask it a question your corpus can't answer. Confirm it returns 'I don't know', not hallucination.
Tune the threshold on your eval set: plot positive vs negative similarities, pick the value with cleanest separation.
Try tiered thresholds (stricter at rank 1, looser later). Often improves recall on multi-fact questions without hurting precision.
Add the score to your logs: every query logs the top-1 similarity. Now you can audit 'what fraction of queries are below threshold?' — that's your recall ceiling.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Why does plain top-k retrieval cause hallucinations? What's the fix and what's the cost?

2. Why it works (the mechanism)

Walk me through how to derive a similarity threshold from labeled data. Why is the precision/recall trade-off hard to escape?

3. Advanced — application & what's next

I have a domain where some queries have exact answers in the corpus and others require synthesis from multiple chunks. Should I use one threshold, two thresholds, or a learned classifier? Why?

References

Chat about this lesson

import numpy as np

def search_with_threshold(query, k=5, min_sim=0.70):
    qv = embed([query])[0]
    sims = (DOC_VECS @ qv) / (
        np.linalg.norm(DOC_VECS, axis=1) * np.linalg.norm(qv)
    )
    idxs = sims.argsort()[::-1][:k]
    return [(DOCS[i], float(sims[i])) for i in idxs if sims[i] >= min_sim]

def answer(question: str) -> str:
    hits = search_with_threshold(question)
    if not hits:
        return "I don't have information about that."
    context = "\n\n---\n\n".join(text for text, _ in hits)
    # ... call LLM with context

# Tuning the threshold from your eval set
def find_threshold(pos_pairs, neg_pairs):
    pos_sims = [cosine(embed([q])[0], embed([d])[0]) for q, d in pos_pairs]
    neg_sims = [cosine(embed([q])[0], embed([d])[0]) for q, d in neg_pairs]
    # pick threshold that maximizes (precision + recall) on the eval
    for t in np.arange(0.5, 0.9, 0.02):
        tp = sum(1 for s in pos_sims if s >= t)
        fp = sum(1 for s in neg_sims if s >= t)
        fn = sum(1 for s in pos_sims if s < t)
        print(f"t={t:.2f}  precision={tp/(tp+fp+1e-6):.2f}  recall={tp/(tp+fn+1e-6):.2f}")

# Variable thresholds: stricter at #1 than at #5
def search_tiered(query):
    qv = embed([query])[0]
    sims = ...
    idxs = sims.argsort()[::-1]
    out = []
    for rank, i in enumerate(idxs[:10]):
        threshold = [0.75, 0.70, 0.68, 0.65, 0.65, 0.60][min(rank, 5)]
        if sims[i] >= threshold: out.append((DOCS[i], float(sims[i])))
    return out

Run: python3 main.py

Top-k, threshold, and 'I don't know'