Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
An embedding is a vector representation of text that places semantically-similar text close together in N-dimensional space. The choice of embedding model determines your retrieval ceiling — a bad embedding is a corpus where 'how do I get a refund?' lands miles from 'cancel my order'. For 2026, sensible defaults: OpenAI text-embedding-3-small ($0.02/1M tokens, 1536 dims) for general English. Voyage-3 or BGE-large for stronger English. Cohere embed-multilingual-v3 for non-English / cross-lingual. Run a couple of comparison queries against your own data before committing — a 20-minute experiment saves a months-long migration later.
Trade-offs: larger embeddings (1536-3072 dims) capture more nuance at higher storage + retrieval cost. Smaller embeddings (256-768) are cheaper and faster, and often 'good enough' for narrow domains. New: Matryoshka embeddings (Voyage, OpenAI v3) let you truncate vectors to your size budget without re-training — store 256 dims, get 90% of 1536-dim quality. The unsexy truth: embedding model differences matter less than chunking and reranking. Pick a reasonable default and move on.
Use these three in order. Each builds on the one before.
Explain embeddings: what space they live in, why similar texts land close, and what cosine similarity measures geometrically.
Walk me through how an embedding model is trained — the contrastive objective, positive/negative pairs, why anchor-positive-negative triplets work better than absolute labels.
I have a multi-language corpus (English + Hindi + Spanish), users query in any language, and need fast lookup. Which embedding model + index design? What are the trade-offs vs running a per-language pipeline?
# Comparing embedding models on YOUR data — the only test that matters
import numpy as np
from openai import OpenAI
import voyageai
oa = OpenAI()
vo = voyageai.Client()
EVAL_PAIRS = [
("How do I cancel my subscription?", "Subscriptions can be cancelled in account settings under Plan."),
("Why is checkout slow?", "Checkout latency is dominated by the Stripe API round-trip (~800ms)."),
("What's our refund policy?", "Refunds are issued within 14 days of purchase for unused services."),
# ... 20-50 query/answer pairs from your own corpus
]
# negatives (queries paired with WRONG docs) help measure discrimination
NEGATIVES = [
("How do I cancel my subscription?", "Postgres MVCC keeps multiple row versions visible..."),
# ...
]
def cos(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def score_model(name, embed_fn):
pos = [cos(*embed_fn([q, d])) for q, d in EVAL_PAIRS]
neg = [cos(*embed_fn([q, d])) for q, d in NEGATIVES]
margin = np.mean(pos) - np.mean(neg)
return f"{name}: pos={np.mean(pos):.3f} neg={np.mean(neg):.3f} margin={margin:.3f}"
def oa_embed(texts):
r = oa.embeddings.create(model="text-embedding-3-small", input=texts)
return [d.embedding for d in r.data]
def voyage_embed(texts):
r = vo.embed(texts, model="voyage-3")
return r.embeddings
print(score_model("openai-3-small", oa_embed))
print(score_model("voyage-3", voyage_embed))
# higher margin = better discrimination on YOUR data, which is the only metric that matters.python3 main.py