Capstok — learn by doing

Why this matters

Three distance metrics exist for embeddings and they're not interchangeable. Cosine measures angle (ignores magnitude) — most embedding models are trained for cosine. Dot product is cosine + magnitude — useful when embeddings are normalized AND a 'longer' vector means 'more confident'. Euclidean (L2) is geometric distance — rarely used for text but standard for some image models. The default for text embeddings is cosine, but knowing the difference saves you a confusing afternoon when a query returns nonsense because you used the wrong op.

Demo

Most providers (OpenAI, Voyage, Cohere) return normalized embeddings — vectors of unit length. For normalized embeddings, cosine and dot product are identical (and faster to compute via dot). For un-normalized embeddings (e.g. some local models), cosine and dot give different rankings and you must normalize OR pick one explicitly. In pgvector the operators are <=> (cosine), <#> (negative dot product), <-> (L2). In Qdrant/Weaviate, you set the metric at collection creation and you can't easily change later.

Try it yourself

Check your embedding model's docs: are vectors normalized? Which metric does it recommend? If unspecified, assume cosine.
If using pgvector, verify your index's op class matches the metric your queries use (vector_cosine_ops, vector_ip_ops, vector_l2_ops).
Compare cosine vs L2 rankings on a few queries — they'll usually agree on top results but disagree on the long tail. Pick one and stick with it.
Profile: dot product is ~30% faster than cosine on the same hardware. At scale, normalize once at index time and use dot.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Explain cosine, dot product, and Euclidean distance for embeddings. When are cosine and dot equivalent?

2. Why it works (the mechanism)

Walk me through why most embedding models prefer cosine: what does the contrastive training objective optimize, and how does it interact with the geometry of the embedding space?

3. Advanced — application & what's next

I have un-normalized embeddings from a local model where magnitude carries information (longer vector = higher confidence). Should I use dot product or cosine? What changes about my retrieval if I switch?

References

Chat about this lesson

import numpy as np

# normalized embeddings — cosine == dot product
def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot(a, b):
    return np.dot(a, b)

def euclidean(a, b):
    return np.linalg.norm(a - b)

# normalize once at index time -> cheaper queries
def normalize(v):
    n = np.linalg.norm(v)
    return v / n if n > 0 else v

doc_vecs_norm = np.array([normalize(v) for v in doc_vecs])
# now: dot(q_norm, doc_vecs_norm) == cosine for all pairs.

# pgvector operators:
# <=>   cosine distance   (= 1 - cosine_similarity)
# <#>   negative dot      (smaller = more similar after sign flip)
# <->   L2 (Euclidean) distance
#
# Index: vector_cosine_ops, vector_ip_ops, vector_l2_ops respectively.

# DON'T mix: an index built for cosine returns wrong rankings if queried with L2.

# Most embedding model docs tell you which to use; default to cosine if unsure.

Run: python3 main.py

Cosine vs dot product vs Euclidean