Capstok — learn by doing

Why this matters

Beginners wildly overestimate the data they need, then never start. For style and format fine-tuning with LoRA, a few hundred to a few thousand high-quality examples often beats tens of thousands of mediocre ones — quality and consistency matter far more than raw count. Knowing realistic data requirements is what turns 'I could never collect a million examples' into 'I can hand-curate 500 great ones this week and start.' This task gives you defensible numbers so you scope a dataset you can actually build, instead of an imaginary one you never will.

Demo

The demo gives ballpark data targets by task type and flags the real risk — that 200 clean, consistent examples beat 5,000 noisy ones, because the model learns whatever pattern dominates, including the noise.

Try it yourself

Estimate how many clean examples your specific task needs using the targets table.
Compute 'effective examples' for a dataset you imagine collecting at a realistic consistency fraction.
Argue why 300 hand-checked examples might beat 3,000 scraped ones for a format task.
Identify which target row matches your task and whether it secretly belongs in the RAG row.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

In one paragraph, roughly how many examples do I need to fine-tune a model for a style or format task?

2. Why it works (the mechanism)

Walk me through why data quality and consistency matter more than raw quantity for LoRA fine-tuning, step by step.

3. Advanced — application & what's next

Given 10,000 scraped examples of varying quality, would you fine-tune on all of them or curate down to 1,000 clean ones, and how would you decide where the cutoff is?

References

Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit 429 Too Many Requests fast. Survive it: read Retry-After and the x-ratelimit-* headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.

# Rough, practical data targets for LoRA SFT (NOT pretraining-scale).
TARGETS = {
    "format/style change":      "200 - 1,000 clean examples",
    "narrow task (classify/extract)": "500 - 2,000 examples",
    "general assistant behavior": "2,000 - 10,000 examples",
    "new domain knowledge":     "fine-tuning is the wrong tool -> use RAG",
}
for task, n in TARGETS.items():
    print(f"{task:32s} {n}")

# The trap: more data with inconsistent labels HURTS. The model fits the
# dominant pattern -- if 20% of your examples are sloppy, it learns sloppiness.
def usable(n_examples, consistency_fraction):
    return f"{int(n_examples*consistency_fraction)} effective examples (rest add noise)"
print(usable(5000, 0.6))   # 5000 messy examples ~ 3000 useful ones

Run: python3 main.py

How much data do you actually need?