Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Beginners wildly overestimate the data they need, then never start. For style and format fine-tuning with LoRA, a few hundred to a few thousand high-quality examples often beats tens of thousands of mediocre ones — quality and consistency matter far more than raw count. Knowing realistic data requirements is what turns 'I could never collect a million examples' into 'I can hand-curate 500 great ones this week and start.' This task gives you defensible numbers so you scope a dataset you can actually build, instead of an imaginary one you never will.
The demo gives ballpark data targets by task type and flags the real risk — that 200 clean, consistent examples beat 5,000 noisy ones, because the model learns whatever pattern dominates, including the noise.
Use these three in order. Each builds on the one before.
In one paragraph, roughly how many examples do I need to fine-tune a model for a style or format task?
Walk me through why data quality and consistency matter more than raw quantity for LoRA fine-tuning, step by step.
Given 10,000 scraped examples of varying quality, would you fine-tune on all of them or curate down to 1,000 clean ones, and how would you decide where the cutoff is?
Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit
429 Too Many Requestsfast. Survive it: readRetry-Afterand thex-ratelimit-*headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.
# Rough, practical data targets for LoRA SFT (NOT pretraining-scale).
TARGETS = {
"format/style change": "200 - 1,000 clean examples",
"narrow task (classify/extract)": "500 - 2,000 examples",
"general assistant behavior": "2,000 - 10,000 examples",
"new domain knowledge": "fine-tuning is the wrong tool -> use RAG",
}
for task, n in TARGETS.items():
print(f"{task:32s} {n}")
# The trap: more data with inconsistent labels HURTS. The model fits the
# dominant pattern -- if 20% of your examples are sloppy, it learns sloppiness.
def usable(n_examples, consistency_fraction):
return f"{int(n_examples*consistency_fraction)} effective examples (rest add noise)"
print(usable(5000, 0.6)) # 5000 messy examples ~ 3000 useful onespython3 main.py