Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Real corpora are messy. Markdown with headings. HTML pages with navigation cruft. PDFs with tables. A 'one-size chunker' produces broken chunks for at least half your docs. The right structure is a dispatcher: detect doc type, route to the right parser + chunker, attach standard metadata. Once you have the dispatcher, adding doc types is a plugin.
Build the ingest pipeline as ingest(doc) → Chunk[]. Each chunk gets: id, doc_id, chunk_index, text, source URL, section path (for markdown headings), source_hash (for change detection), tenant_id (if multi-tenant). For markdown: use heading-aware splitter (LangChain's MarkdownHeaderTextSplitter). For HTML: strip nav + footer with BeautifulSoup. For PDFs: pdfplumber, then recursive splitter. Standardize the output schema; never let downstream code know which parser was used.
Use these three in order. Each builds on the one before.
Why does a one-size chunker fail on mixed-format corpora? Give 2 examples.
Walk me through MarkdownHeaderTextSplitter. Why does heading-aware splitting help retrieval?
Design a chunker for HTML docs with tables and code blocks. What gets stripped, what gets preserved as separate chunks?
from dataclasses import dataclass, field
from langchain_text_splitters import RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter
import hashlib
@dataclass
class Chunk:
id: str
doc_id: str
chunk_index: int
text: str
source_url: str
section_path: list[str] = field(default_factory=list)
source_hash: str = ""
def detect_type(path):
if path.endswith(".md"): return "markdown"
if path.endswith(".html"): return "html"
if path.endswith(".pdf"): return "pdf"
return "text"
def parse_text(path, content) -> str:
t = detect_type(path)
if t == "html":
from bs4 import BeautifulSoup
return BeautifulSoup(content, "lxml").get_text("\n", strip=True)
if t == "pdf":
import pdfplumber, io
with pdfplumber.open(io.BytesIO(content)) as pdf:
return "\n\n".join(p.extract_text() or "" for p in pdf.pages)
return content.decode() if isinstance(content, bytes) else content
def chunker_for(doc_type):
if doc_type == "markdown":
return MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
)
return RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
def ingest(path, content, base_metadata) -> list[Chunk]:
doc_type = detect_type(path)
text = parse_text(path, content)
splitter = chunker_for(doc_type)
raw = splitter.split_text(text)
chunks = []
for i, c in enumerate(raw):
chunk_text = c.page_content if hasattr(c, "page_content") else c
metadata = c.metadata if hasattr(c, "metadata") else {}
section_path = [v for k, v in sorted(metadata.items()) if k.startswith("h")]
chunks.append(Chunk(
id=f"{base_metadata['doc_id']}:chunk_{i}",
doc_id=base_metadata["doc_id"],
chunk_index=i,
text=chunk_text,
source_url=base_metadata["url"],
section_path=section_path,
source_hash=hashlib.sha256(chunk_text.encode()).hexdigest(),
))
return chunkspython3 main.py