Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Hybrid retrieval gives you top-50 candidates. The LLM only sees top-5. The ordering matters enormously. A reranker (Cohere Rerank-3, Voyage Rerank-2, BGE Reranker self-hosted) is a cross-encoder model that scores (query, doc) pairs precisely. Costs ~$1/1K queries managed; usually lifts answer accuracy 10-25%.
Workflow: hybrid → top-50 → fetch chunk text → rerank → top-5 → feed to LLM. Latency: ~80-150ms for reranking 50 docs. Cost: ~50-100/mo) is free at scale. For the project, Cohere/Voyage is the fastest path.
Use these three in order. Each builds on the one before.
Why does reranking help so much? What does the cross-encoder do differently?
Walk me through cross-encoder vs bi-encoder retrieval.
I have 1000 QPS sustained. Should I use Cohere managed or self-host BGE Reranker? Compute cost-latency.
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
async def rerank(query, candidate_ids, k=5):
# fetch text
rows = db.query("SELECT id, text, metadata FROM chunks WHERE id = ANY(%s)", (candidate_ids,))
chunks_by_id = {r["id"]: r for r in rows}
docs = [chunks_by_id[i]["text"] for i in candidate_ids if i in chunks_by_id]
r = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=k,
)
# map back to chunk records preserving rerank order
ordered_ids = [candidate_ids[item.index] for item in r.results]
return [chunks_by_id[i] for i in ordered_ids]
async def search(query, k=5):
candidates = await hybrid_search(query, k=50)
return await rerank(query, candidates, k=k)python3 main.py