Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
RAG (retrieval-augmented generation) is a pattern, not a product. The model alone hallucinates because it knows what was in its training data; RAG injects fresh, ground-truth context into the prompt at inference time. The whole game is: pick the right chunks of your documents, paste them into the system message, and let the model answer with them in scope. Everything else — better retrievers, smarter chunking, reranking, agents — is just iterating on 'pick the right chunks'. If you don't have the basic loop working, the fancy techniques won't save you.
Naive RAG, end to end, in 30 lines: split your docs into chunks of ~500 tokens, embed each chunk with a model like text-embedding-3-small, store embeddings + text in any vector DB (or a flat numpy array if you have <10k chunks). At query time, embed the user's question, find the top-5 nearest chunks by cosine similarity, paste them above the question in a system message, call the LLM. This is the version that ships in 90% of demos and serves as the floor we'll improve from.
Use these three in order. Each builds on the one before.
In one paragraph, explain RAG: what's retrieved, what's augmented, what's generated. Why does it reduce hallucinations vs vanilla LLM calls?
Walk me through what 'cosine similarity over embeddings' actually does — what's an embedding, what does the cosine measure semantically, and why is it 'find similar text' instead of exact match?
I have 50,000 docs of 5KB each. Walk me through what happens at index time (cost, time, storage) and at query time (latency budget, where it spends). What scales linearly with N and what doesn't?
import os
import numpy as np
from openai import OpenAI
from anthropic import Anthropic
llm = Anthropic()
emb_client = OpenAI() # embeddings via OpenAI; LLM via Claude
DOCS = [
"Postgres MVCC keeps multiple row versions visible to in-flight transactions...",
"A long transaction prevents vacuum from reclaiming dead tuples...",
"Connection pools should be sized as (2 * cpus) + 1 per app instance...",
# ... a few hundred more in practice
]
def embed(texts: list[str]) -> np.ndarray:
r = emb_client.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([d.embedding for d in r.data])
# index once
DOC_VECS = embed(DOCS) # shape: (N, 1536)
def search(query: str, k=5) -> list[str]:
qv = embed([query])[0]
sims = DOC_VECS @ qv / (np.linalg.norm(DOC_VECS, axis=1) * np.linalg.norm(qv))
top = sims.argsort()[::-1][:k]
return [DOCS[i] for i in top]
def answer(question: str) -> str:
context = "\n\n---\n\n".join(search(question))
msg = llm.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=f"Answer ONLY from this context. If absent, say you don't know.\n\nContext:\n{context}",
messages=[{"role": "user", "content": question}],
)
return msg.content[0].text
print(answer("Why does a long transaction cause table bloat?"))python3 main.py