Project · Ship naive RAG end-to-end

project

hard

module project

Ship something real. Submit your work when you're done.

Brief

Build a naive RAG system against any corpus you have (or 50+ Wikipedia articles). Include: chunking, embedding, vector store (numpy or pgvector), top-k with threshold, a careful system prompt, and a 30-item eval set with retrieval + answer-quality metrics. Submit the code, eval results, and a 1-page writeup of the 3 main failure modes you observed.

Deliverables

Repo with index.py / index.ts (chunk + embed + store) and query.py / query.ts (search + LLM call).
30-item eval set with (question, expected fact, expected chunk id) triples.
Eval output: retrieval@5 and answer accuracy numbers.
Writeup: 3 failure modes observed + which later technique might fix each.

How we grade it

Naive RAG actually works end-to-end and answers questions from your corpus.
Eval set is real (questions from real users or a domain expert, not generated by an LLM).
Similarity threshold is in place; 'I don't know' is returned when retrieval is weak.
Failure modes are named and categorized (not 'sometimes wrong').

Project · Ship naive RAG end-to-end

Hints

Expected output

Stretch goals