Capstok — learn by doing

Why this matters

Real corpora are messy. Markdown with headings. HTML pages with navigation cruft. PDFs with tables. A 'one-size chunker' produces broken chunks for at least half your docs. The right structure is a dispatcher: detect doc type, route to the right parser + chunker, attach standard metadata. Once you have the dispatcher, adding doc types is a plugin.

Demo

Build the ingest pipeline as ingest(doc) → Chunk[]. Each chunk gets: id, doc_id, chunk_index, text, source URL, section path (for markdown headings), source_hash (for change detection), tenant_id (if multi-tenant). For markdown: use heading-aware splitter (LangChain's MarkdownHeaderTextSplitter). For HTML: strip nav + footer with BeautifulSoup. For PDFs: pdfplumber, then recursive splitter. Standardize the output schema; never let downstream code know which parser was used.

Try it yourself

Build the dispatcher for 2-3 doc types in your dataset (start with markdown + HTML if applicable).
Inspect 20 random chunks. Are they coherent units (paragraphs, sections)? If many cut mid-sentence, fix the splitter for that type.
Save chunks to a JSONL file before embedding. Re-embed on demand without re-parsing.
Capture metadata generously. section_path turns into clickable citations later.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

Why does a one-size chunker fail on mixed-format corpora? Give 2 examples.

2. Why it works (the mechanism)

Walk me through MarkdownHeaderTextSplitter. Why does heading-aware splitting help retrieval?

3. Advanced — application & what's next

Design a chunker for HTML docs with tables and code blocks. What gets stripped, what gets preserved as separate chunks?

References

Chat about this lesson

from dataclasses import dataclass, field
from langchain_text_splitters import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
import hashlib

@dataclass
class Chunk:
    id: str
    doc_id: str
    chunk_index: int
    text: str
    source_url: str
    section_path: list[str] = field(default_factory=list)
    source_hash: str = ""

def detect_type(path):
    if path.endswith(".md"): return "markdown"
    if path.endswith(".html"): return "html"
    if path.endswith(".pdf"): return "pdf"
    return "text"

def parse_text(path, content) -> str:
    t = detect_type(path)
    if t == "html":
        from bs4 import BeautifulSoup
        return BeautifulSoup(content, "lxml").get_text("\n", strip=True)
    if t == "pdf":
        import pdfplumber, io
        with pdfplumber.open(io.BytesIO(content)) as pdf:
            return "\n\n".join(p.extract_text() or "" for p in pdf.pages)
    return content.decode() if isinstance(content, bytes) else content

def chunker_for(doc_type):
    if doc_type == "markdown":
        return MarkdownHeaderTextSplitter(
            headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
        )
    return RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

def ingest(path, content, base_metadata) -> list[Chunk]:
    doc_type = detect_type(path)
    text = parse_text(path, content)
    splitter = chunker_for(doc_type)
    raw = splitter.split_text(text)

    chunks = []
    for i, c in enumerate(raw):
        chunk_text = c.page_content if hasattr(c, "page_content") else c
        metadata = c.metadata if hasattr(c, "metadata") else {}
        section_path = [v for k, v in sorted(metadata.items()) if k.startswith("h")]
        chunks.append(Chunk(
            id=f"{base_metadata['doc_id']}:chunk_{i}",
            doc_id=base_metadata["doc_id"],
            chunk_index=i,
            text=chunk_text,
            source_url=base_metadata["url"],
            section_path=section_path,
            source_hash=hashlib.sha256(chunk_text.encode()).hexdigest(),
        ))
    return chunks

Run: python3 main.py

Document ingestion + chunking dispatcher