Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Dense alone misses exact keyword matches (product names, error codes, identifiers). BM25 alone misses paraphrases. Hybrid wins on both — run them in parallel, merge with Reciprocal Rank Fusion (RRF). For most corpora, hybrid lifts retrieval@5 by 5-15% over dense alone. Costs almost nothing (BM25 is fast in Postgres tsvector).
Implementation: pgvector handles dense, tsvector + GIN handles BM25, RRF fuses them in one CTE. Cap each retriever at 50 candidates; fuse to top-50 unique; pass to reranker (next task). The whole thing is one SQL query in Postgres OR two parallel queries in your app then RRF in Python.
<50ms vs dense-only.Use these three in order. Each builds on the one before.
Why hybrid search beats either alone? Give a query type for each.
Walk me through RRF: why does it not need score normalization?
I have a domain with lots of code identifiers. Dense misses them. Help me weight RRF in BM25's favor for those queries.
import asyncio
def rrf(rankings, k=60):
"""rankings: list[list[id]]. Returns sorted list of (id, score)."""
scores = {}
for r in rankings:
for rank, id in enumerate(r):
scores[id] = scores.get(id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
async def dense_search(query, k=50):
qv = embed_batch([query])[0]
return db.query("""
SELECT id FROM chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (qv, k))
async def bm25_search(query, k=50):
return db.query("""
SELECT id FROM chunks, plainto_tsquery('english', %s) q
WHERE tsv @@ q
ORDER BY ts_rank_cd(tsv, q) DESC
LIMIT %s
""", (query, k))
async def hybrid_search(query, k=50):
dense, sparse = await asyncio.gather(
dense_search(query, k=k),
bm25_search(query, k=k),
)
dense_ids = [r["id"] for r in dense]
sparse_ids = [r["id"] for r in sparse]
fused = rrf([dense_ids, sparse_ids])
return [id for id, _ in fused[:k]]
# Or single-query SQL hybrid (faster on a single Postgres)
HYBRID_SQL = """
WITH dense AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rank
FROM chunks ORDER BY embedding <=> $1::vector LIMIT 50
),
sparse AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank_cd(tsv, q) DESC) AS rank
FROM chunks, plainto_tsquery('english', $2) q
WHERE tsv @@ q ORDER BY ts_rank_cd(tsv, q) DESC LIMIT 50
)
SELECT id, SUM(1.0 / (60 + rank)) AS score
FROM (SELECT id, rank FROM dense UNION ALL SELECT id, rank FROM sparse) u
GROUP BY id ORDER BY score DESC LIMIT 50
"""python3 main.py