Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Three distance metrics exist for embeddings and they're not interchangeable. Cosine measures angle (ignores magnitude) — most embedding models are trained for cosine. Dot product is cosine + magnitude — useful when embeddings are normalized AND a 'longer' vector means 'more confident'. Euclidean (L2) is geometric distance — rarely used for text but standard for some image models. The default for text embeddings is cosine, but knowing the difference saves you a confusing afternoon when a query returns nonsense because you used the wrong op.
Most providers (OpenAI, Voyage, Cohere) return normalized embeddings — vectors of unit length. For normalized embeddings, cosine and dot product are identical (and faster to compute via dot). For un-normalized embeddings (e.g. some local models), cosine and dot give different rankings and you must normalize OR pick one explicitly. In pgvector the operators are <=> (cosine), <#> (negative dot product), <-> (L2). In Qdrant/Weaviate, you set the metric at collection creation and you can't easily change later.
Use these three in order. Each builds on the one before.
Explain cosine, dot product, and Euclidean distance for embeddings. When are cosine and dot equivalent?
Walk me through why most embedding models prefer cosine: what does the contrastive training objective optimize, and how does it interact with the geometry of the embedding space?
I have un-normalized embeddings from a local model where magnitude carries information (longer vector = higher confidence). Should I use dot product or cosine? What changes about my retrieval if I switch?
import numpy as np
# normalized embeddings — cosine == dot product
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def dot(a, b):
return np.dot(a, b)
def euclidean(a, b):
return np.linalg.norm(a - b)
# normalize once at index time -> cheaper queries
def normalize(v):
n = np.linalg.norm(v)
return v / n if n > 0 else v
doc_vecs_norm = np.array([normalize(v) for v in doc_vecs])
# now: dot(q_norm, doc_vecs_norm) == cosine for all pairs.
# pgvector operators:
# <=> cosine distance (= 1 - cosine_similarity)
# <#> negative dot (smaller = more similar after sign flip)
# <-> L2 (Euclidean) distance
#
# Index: vector_cosine_ops, vector_ip_ops, vector_l2_ops respectively.
# DON'T mix: an index built for cosine returns wrong rankings if queried with L2.
# Most embedding model docs tell you which to use; default to cosine if unsure.python3 main.py