Build a naive RAG system against any corpus you have (or 50+ Wikipedia articles). Include: chunking, embedding, vector store (numpy or pgvector), top-k with threshold, a careful system prompt, and a 30-item eval set with retrieval + answer-quality metrics. Submit the code, eval results, and a 1-page writeup of the 3 main failure modes you observed.
# Naive RAG results
- Corpus: 412 docs (internal engineering wiki)
- Chunks: 1,847 (~500 tokens, 100 overlap)
- Embedding: text-embedding-3-small (1536 dim)
- Store: pgvector + HNSW
# Eval (30 cases)
- Retrieval@5: 73%
- Answer accuracy (LLM-judge): 67%
- Refusal rate: 12%
# Top failure modes
1. Chunk boundary splits answer (8/30 cases) -> Module 2 fixes
2. Query vocabulary mismatch (5/30) -> Module 5 query rewriting fixes
3. Multi-hop required (3/30) -> Module 6 fixes