Embeddings + pgvector index

medium

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Embed every chunk and store in pgvector. The choice of embedding model (text-embedding-3-small for English; voyage-3 for stronger quality; cohere-multilingual for non-English) determines your retrieval ceiling. pgvector + HNSW handles up to 1M chunks well; beyond that, Module 7 of the RAG course covers specialized vector DBs.

Demo

Setup: enable pgvector extension, define a chunks table with embedding vector(1536) column + HNSW index. Embed chunks in batches (up to 2048 per OpenAI request). Upsert by chunk id so re-runs are idempotent. Track which embedding model + version produced each row in metadata — when you upgrade the model, you'll need to know what's old.

Try it yourself

Stand up pgvector locally (Supabase / Neon / Docker). Run the SQL.
Index all your chunks. Time it. ~10K chunks should take 1-2 minutes with batched embeddings.
Test: run a sample query against chunks ORDER BY embedding <=> ... — verify HNSW is being used (EXPLAIN ANALYZE).
If you re-run ingest, ON CONFLICT DO UPDATE makes it idempotent. Verify by re-running and checking row count.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

What does pgvector add to Postgres? Name 3 reasons to start there vs Pinecone.

2. Why it works (the mechanism)

Walk me through HNSW: what does m, ef_construction, ef_search control?

3. Advanced — application & what's next

I have 10M chunks. Should I stay on pgvector? Walk through the decision.

References

Chat about this lesson

# SQL CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE chunks ( id text PRIMARY KEY, doc_id text NOT NULL, chunk_index int NOT NULL, text text NOT NULL, embedding vector(1536), tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED, metadata jsonb, source_hash text NOT NULL, embed_model text NOT NULL, created_at timestamptz NOT NULL DEFAULT now() ); CREATE INDEX chunks_emb_idx ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64); CREATE INDEX chunks_tsv_idx ON chunks USING gin(tsv); CREATE INDEX chunks_doc_idx ON chunks(doc_id); # Python ingest import os, asyncio, psycopg from pgvector.psycopg import register_vector from openai import OpenAI oai = OpenAI() MODEL = "text-embedding-3-small" def embed_batch(texts): r = oai.embeddings.create(model=MODEL, input=texts) return [d.embedding for d in r.data] async def index_chunks(chunks, batch_size=256): with psycopg.connect(os.environ["DATABASE_URL"]) as conn: register_vector(conn) for i in range(0, len(chunks), batch_size): batch = chunks[i:i+batch_size] vecs = embed_batch([c.text for c in batch]) with conn.cursor() as cur: cur.executemany(""" INSERT INTO chunks (id, doc_id, chunk_index, text, embedding, metadata, source_hash, embed_model) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) ON CONFLICT (id) DO UPDATE SET text = EXCLUDED.text, embedding = EXCLUDED.embedding, source_hash = EXCLUDED.source_hash """, [(c.id, c.doc_id, c.chunk_index, c.text, v, {"section_path": c.section_path, "source_url": c.source_url}, c.source_hash, MODEL) for c, v in zip(batch, vecs)]) conn.commit() print(f"Indexed {i + len(batch)} / {len(chunks)}")