Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
A bad RAG system prompt invites hallucination even when retrieval is perfect. A good system prompt scaffolds the model toward (1) answer only from context, (2) cite which chunks were used, (3) say 'I don't know' when context is insufficient. Five lines of careful instruction reduce hallucination rates by 30-50% on common eval sets. Most teams skip this and blame the model when the actual fix is the prompt.
Components that matter: explicit anti-hallucination rule ('Answer ONLY from the provided context'), refusal instruction ('If the context does not contain the answer, say I don't have that information'), citation instruction ('Cite the chunk number for each claim'), and the context formatting (give each chunk a clear delimiter and an ID). Bonus: instruct the model to quote verbatim for factual claims — quotes are easier to verify than paraphrases.
Use these three in order. Each builds on the one before.
What's the difference between a vague RAG system prompt and a careful one? Give an example of each.
Walk me through *why* asking the model to cite chunks reduces hallucination — what's the mechanism inside the LLM that makes citations harder to fabricate?
Design a prompt that handles partial-answer cases gracefully: 'I know A from the context, but B is not specified. Possible interpretations: ...'. What's the structure?
SYSTEM_RAG = """You are a careful assistant that answers questions using ONLY the provided context. Follow these rules strictly:
1. If the answer is in the context, give it concisely and cite the chunk number(s) like [#3].
2. If the answer is partially in the context, give what's there and clearly note what is missing.
3. If the answer is NOT in the context, say exactly: "I don't have information about that in the provided documents."
4. Never use general knowledge to fill gaps. Never speculate.
5. Quote verbatim from the context when making factual claims.
Context (numbered chunks):
{context}
"""
def format_context(chunks: list[tuple[str, float]]) -> str:
return "\n\n".join(
f"[#{i+1}] (similarity {sim:.2f})\n{text}"
for i, (text, sim) in enumerate(chunks)
)
def answer(question):
hits = search_with_threshold(question, k=5, min_sim=0.70)
if not hits:
return "I don't have information about that in the provided documents."
msg = llm.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=SYSTEM_RAG.format(context=format_context(hits)),
messages=[{"role": "user", "content": question}],
)
return msg.content[0].text
# example output (with this prompt + good retrieval):
# "Postgres MVCC creates multiple row versions when concurrent transactions
# modify the same row [#1]. A long-running transaction keeps all those versions
# alive because they might still be visible to it [#2], which prevents
# vacuum from reclaiming dead tuples [#2]."python3 main.py