Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Naive RAG fetches top-k unconditionally — even when nothing in your corpus matches. The model dutifully answers from irrelevant context and hallucinates a confident-sounding lie. The fix is to enforce a similarity threshold: only inject chunks above a cosine threshold (e.g. 0.7); if no chunks pass, the answer is 'I don't know' or 'I don't have that information'. This single rule turns RAG from 'pretty good when it works' into 'reliable in production'. The hard part is finding the right threshold — it varies by embedding model and domain.
How to find your threshold: take your eval set (positives + negatives), compute the cosine of each pair, plot the distribution. The right threshold is the value that separates positives from negatives. Usually 0.6-0.8 for cosine with modern embeddings. Don't pick from a tutorial — measure on YOUR data. Variable thresholds (e.g. 'top result must beat 0.75; followups can be 0.65') work even better.
Use these three in order. Each builds on the one before.
Why does plain top-k retrieval cause hallucinations? What's the fix and what's the cost?
Walk me through how to derive a similarity threshold from labeled data. Why is the precision/recall trade-off hard to escape?
I have a domain where some queries have exact answers in the corpus and others require synthesis from multiple chunks. Should I use one threshold, two thresholds, or a learned classifier? Why?
import numpy as np
def search_with_threshold(query, k=5, min_sim=0.70):
qv = embed([query])[0]
sims = (DOC_VECS @ qv) / (
np.linalg.norm(DOC_VECS, axis=1) * np.linalg.norm(qv)
)
idxs = sims.argsort()[::-1][:k]
return [(DOCS[i], float(sims[i])) for i in idxs if sims[i] >= min_sim]
def answer(question: str) -> str:
hits = search_with_threshold(question)
if not hits:
return "I don't have information about that."
context = "\n\n---\n\n".join(text for text, _ in hits)
# ... call LLM with context
# Tuning the threshold from your eval set
def find_threshold(pos_pairs, neg_pairs):
pos_sims = [cosine(embed([q])[0], embed([d])[0]) for q, d in pos_pairs]
neg_sims = [cosine(embed([q])[0], embed([d])[0]) for q, d in neg_pairs]
# pick threshold that maximizes (precision + recall) on the eval
for t in np.arange(0.5, 0.9, 0.02):
tp = sum(1 for s in pos_sims if s >= t)
fp = sum(1 for s in neg_sims if s >= t)
fn = sum(1 for s in pos_sims if s < t)
print(f"t={t:.2f} precision={tp/(tp+fp+1e-6):.2f} recall={tp/(tp+fn+1e-6):.2f}")
# Variable thresholds: stricter at #1 than at #5
def search_tiered(query):
qv = embed([query])[0]
sims = ...
idxs = sims.argsort()[::-1]
out = []
for rank, i in enumerate(idxs[:10]):
threshold = [0.75, 0.70, 0.68, 0.65, 0.65, 0.60][min(rank, 5)]
if sims[i] >= threshold: out.append((DOCS[i], float(sims[i])))
return outpython3 main.py