Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Embed every chunk and store in pgvector. The choice of embedding model (text-embedding-3-small for English; voyage-3 for stronger quality; cohere-multilingual for non-English) determines your retrieval ceiling. pgvector + HNSW handles up to 1M chunks well; beyond that, Module 7 of the RAG course covers specialized vector DBs.
Setup: enable pgvector extension, define a chunks table with embedding vector(1536) column + HNSW index. Embed chunks in batches (up to 2048 per OpenAI request). Upsert by chunk id so re-runs are idempotent. Track which embedding model + version produced each row in metadata — when you upgrade the model, you'll need to know what's old.
chunks ORDER BY embedding <=> ... — verify HNSW is being used (EXPLAIN ANALYZE).Use these three in order. Each builds on the one before.
What does pgvector add to Postgres? Name 3 reasons to start there vs Pinecone.
Walk me through HNSW: what does m, ef_construction, ef_search control?
I have 10M chunks. Should I stay on pgvector? Walk through the decision.
# SQL
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunks (
id text PRIMARY KEY,
doc_id text NOT NULL,
chunk_index int NOT NULL,
text text NOT NULL,
embedding vector(1536),
tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
metadata jsonb,
source_hash text NOT NULL,
embed_model text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX chunks_emb_idx ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64);
CREATE INDEX chunks_tsv_idx ON chunks USING gin(tsv);
CREATE INDEX chunks_doc_idx ON chunks(doc_id);
# Python ingest
import os, asyncio, psycopg
from pgvector.psycopg import register_vector
from openai import OpenAI
oai = OpenAI()
MODEL = "text-embedding-3-small"
def embed_batch(texts):
r = oai.embeddings.create(model=MODEL, input=texts)
return [d.embedding for d in r.data]
async def index_chunks(chunks, batch_size=256):
with psycopg.connect(os.environ["DATABASE_URL"]) as conn:
register_vector(conn)
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
vecs = embed_batch([c.text for c in batch])
with conn.cursor() as cur:
cur.executemany("""
INSERT INTO chunks (id, doc_id, chunk_index, text, embedding, metadata, source_hash, embed_model)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (id) DO UPDATE SET
text = EXCLUDED.text,
embedding = EXCLUDED.embedding,
source_hash = EXCLUDED.source_hash
""", [(c.id, c.doc_id, c.chunk_index, c.text, v, {"section_path": c.section_path, "source_url": c.source_url}, c.source_hash, MODEL)
for c, v in zip(batch, vecs)])
conn.commit()
print(f"Indexed {i + len(batch)} / {len(chunks)}")python3 main.py