Capstok — learn by doing

Why this matters

RAG (retrieval-augmented generation) is a pattern, not a product. The model alone hallucinates because it knows what was in its training data; RAG injects fresh, ground-truth context into the prompt at inference time. The whole game is: pick the right chunks of your documents, paste them into the system message, and let the model answer with them in scope. Everything else — better retrievers, smarter chunking, reranking, agents — is just iterating on 'pick the right chunks'. If you don't have the basic loop working, the fancy techniques won't save you.

Demo

Naive RAG, end to end, in 30 lines: split your docs into chunks of ~500 tokens, embed each chunk with a model like text-embedding-3-small, store embeddings + text in any vector DB (or a flat numpy array if you have <10k chunks). At query time, embed the user's question, find the top-5 nearest chunks by cosine similarity, paste them above the question in a system message, call the LLM. This is the version that ships in 90% of demos and serves as the floor we'll improve from.

Try it yourself

Stand up the naive loop above against any 20-50 documents you have lying around (a wiki dump, a manual, a markdown folder). Confirm it answers questions from them.
Ask it a question whose answer isn't in your docs. It should say 'I don't know' — if it hallucinates, your system prompt is too permissive.
Vary k (the number of chunks injected). At k=1 it's brittle; at k=20 the context fills with noise and answers degrade. Find your knee.
Time a query end to end. Embedding the question is usually 50-150ms; LLM call is 1-3 seconds. Most latency is in the LLM, not retrieval.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, explain RAG: what's retrieved, what's augmented, what's generated. Why does it reduce hallucinations vs vanilla LLM calls?

2. Why it works (the mechanism)

Walk me through what 'cosine similarity over embeddings' actually does — what's an embedding, what does the cosine measure semantically, and why is it 'find similar text' instead of exact match?

3. Advanced — application & what's next

I have 50,000 docs of 5KB each. Walk me through what happens at index time (cost, time, storage) and at query time (latency budget, where it spends). What scales linearly with N and what doesn't?

References

Chat about this lesson

import os
import numpy as np
from openai import OpenAI
from anthropic import Anthropic

llm = Anthropic()
emb_client = OpenAI()  # embeddings via OpenAI; LLM via Claude

DOCS = [
    "Postgres MVCC keeps multiple row versions visible to in-flight transactions...",
    "A long transaction prevents vacuum from reclaiming dead tuples...",
    "Connection pools should be sized as (2 * cpus) + 1 per app instance...",
    # ... a few hundred more in practice
]

def embed(texts: list[str]) -> np.ndarray:
    r = emb_client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in r.data])

# index once
DOC_VECS = embed(DOCS)              # shape: (N, 1536)

def search(query: str, k=5) -> list[str]:
    qv = embed([query])[0]
    sims = DOC_VECS @ qv / (np.linalg.norm(DOC_VECS, axis=1) * np.linalg.norm(qv))
    top = sims.argsort()[::-1][:k]
    return [DOCS[i] for i in top]

def answer(question: str) -> str:
    context = "\n\n---\n\n".join(search(question))
    msg = llm.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=f"Answer ONLY from this context. If absent, say you don't know.\n\nContext:\n{context}",
        messages=[{"role": "user", "content": question}],
    )
    return msg.content[0].text

print(answer("Why does a long transaction cause table bloat?"))

Run: python3 main.py

What RAG actually is