Capstok — learn by doing

Why this matters

Dense alone misses exact keyword matches (product names, error codes, identifiers). BM25 alone misses paraphrases. Hybrid wins on both — run them in parallel, merge with Reciprocal Rank Fusion (RRF). For most corpora, hybrid lifts retrieval@5 by 5-15% over dense alone. Costs almost nothing (BM25 is fast in Postgres tsvector).

Demo

Implementation: pgvector handles dense, tsvector + GIN handles BM25, RRF fuses them in one CTE. Cap each retriever at 50 candidates; fuse to top-50 unique; pass to reranker (next task). The whole thing is one SQL query in Postgres OR two parallel queries in your app then RRF in Python.

Try it yourself

Add BM25 to your retrieval. Run hybrid on the same 20-50 eval queries. Compare retrieval@5 vs dense-only.
Find 5 queries where BM25 wins (exact tokens) and 5 where dense wins (paraphrase). The gap is real.
Time the parallel hybrid query. Should add <50ms vs dense-only.
Tune RRF k constant. 60 is the paper default; 10-100 work fine.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Why hybrid search beats either alone? Give a query type for each.

2. Why it works (the mechanism)

Walk me through RRF: why does it not need score normalization?

3. Advanced — application & what's next

I have a domain with lots of code identifiers. Dense misses them. Help me weight RRF in BM25's favor for those queries.

References

Chat about this lesson

import asyncio

def rrf(rankings, k=60):
    """rankings: list[list[id]]. Returns sorted list of (id, score)."""
    scores = {}
    for r in rankings:
        for rank, id in enumerate(r):
            scores[id] = scores.get(id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

async def dense_search(query, k=50):
    qv = embed_batch([query])[0]
    return db.query("""
        SELECT id FROM chunks
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (qv, k))

async def bm25_search(query, k=50):
    return db.query("""
        SELECT id FROM chunks, plainto_tsquery('english', %s) q
        WHERE tsv @@ q
        ORDER BY ts_rank_cd(tsv, q) DESC
        LIMIT %s
    """, (query, k))

async def hybrid_search(query, k=50):
    dense, sparse = await asyncio.gather(
        dense_search(query, k=k),
        bm25_search(query, k=k),
    )
    dense_ids = [r["id"] for r in dense]
    sparse_ids = [r["id"] for r in sparse]
    fused = rrf([dense_ids, sparse_ids])
    return [id for id, _ in fused[:k]]

# Or single-query SQL hybrid (faster on a single Postgres)
HYBRID_SQL = """
WITH dense AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rank
  FROM chunks ORDER BY embedding <=> $1::vector LIMIT 50
),
sparse AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank_cd(tsv, q) DESC) AS rank
  FROM chunks, plainto_tsquery('english', $2) q
  WHERE tsv @@ q ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 50
)
SELECT id, SUM(1.0 / (60 + rank)) AS score
FROM (SELECT id, rank FROM dense UNION ALL SELECT id, rank FROM sparse) u
GROUP BY id ORDER BY score DESC LIMIT 50
"""

Run: python3 main.py

BM25 + hybrid search with RRF