Updated: 2026-01-31 By: Ari Heljakka
Short answer
Dataset size affects LLM fine-tuning performance on a power-law curve with diminishing returns. The right size depends on the task: simple classification often works with a few hundred examples; structured extraction wants a few hundred to a couple thousand; content generation and domain adaptation need thousands to tens of thousands; translation and summarization can need hundreds of thousands. Quality, task alignment, and a calibrated evaluation suite matter more than raw count. Small, curated datasets routinely outperform larger noisy ones; every training run should be gated against a versioned ground-truth evaluation suite that catches both target-task improvements and source-task regressions.
Key facts
- Definition: The relationship between training-set size and post-fine-tuning model performance, governed by power-law scaling, task alignment, and data quality.
- When to use: Any fine-tuning decision: deciding whether to collect more data, whether the current dataset is enough, whether the diminishing-returns inflection has been hit.
- Limitations: Scaling laws give shape, not absolute targets. The right size for a specific task can only be confirmed by a calibrated evaluation curve.
- Example: A team gets to 90 percent classification accuracy on 150 labeled examples; doubling the dataset moves accuracy to 92 percent; the team reallocates budget to label edge cases instead of collecting more typical ones.
Key takeaways
- Size is a coarse lever. Power-law scaling means each 10x increase yields progressively smaller returns.
- Task type sets the baseline. Classification needs hundreds; translation can need hundreds of thousands.
- Data quality dominates raw count. Curated, labeled, deduplicated, task-aligned data beats larger noisy data.
- Parameter-efficient methods (LoRA, QLoRA, adapters) make small datasets effective and keep forgetting risk bounded.
- The evaluation suite is the gate. Without per-dimension scoring against a versioned ground-truth dataset, dataset-size decisions are guesses.
Definition
Dataset-size impact on fine-tuning is the measurable effect of training-set size on a model's post-fine-tuning performance, controlled for base model, training method, and evaluation criteria.
Three properties recur.
- Power-law scaling. When base model size, dataset size, and training quality are balanced, performance improves on a power-law curve. Every 10x increase in dataset size delivers progressively smaller gains.
- Diminishing returns inflection. Most tasks have a band where further data yields a fraction of a point of improvement. Past the inflection, budget is better spent on data quality, edge-case coverage, or evaluation expansion.
- Forgetting risk scales with adaptation depth, not dataset size. A small dataset with full fine-tuning can cause more forgetting than a large dataset with LoRA.
The output of a sizing decision is not a target number; it is a calibrated evaluation curve that shows the per-dimension trade-off between additional data and observed performance gain.
When this matters
Sizing decisions are deliberate engineering concerns when at least one of the following holds.
- Data collection is expensive. Labeled medical, legal, or expert-annotated data costs hundreds to thousands of dollars per example; sizing decisions translate directly to budget.
- Compute budget is bounded. Larger datasets take longer to train and longer to evaluate. The inflection point matters.
- The task is high-stakes. Healthcare, legal, financial, and consumer-visible outputs cannot ship on under-trained models; sizing must clear the calibrated quality bar.
- The base model is small. Smaller base models hit the inflection later than larger ones; size budgets are higher.
- Continued fine-tuning is on the roadmap. Each cycle adds data; the curve helps decide when to stop adding and start curating.
How it works
A defensible dataset-sizing workflow has five components.
Component 1, Pick the right size band by task type
Start from the task. Common bands:
| Task type | Typical range |
|---|---|
| Simple classification | 100 to 300 examples |
| Structured extraction | 200 to 1,000 examples |
| Content generation | 500 to 2,000 examples |
| Domain adaptation | 1,000 to 5,000 examples |
| Translation, summarization | 5,000 to 200,000+ examples |
The bands are starting points, not targets. The evaluation curve confirms or revises them.
Component 2, Build a versioned ground-truth evaluation suite first
Before collecting training data, build the evaluation suite. It has three layers.
- Deterministic checks for format, length, schema, banned phrases.
- Calibrated LLM judges for semantic dimensions (correctness, faithfulness, tone), each tied to a human-labeled probe with documented agreement.
- Source-task regression probe to catch catastrophic forgetting; a held-out sample of the base model's general capability.
Without the evaluation suite, every sizing decision is a guess. With it, every increment of additional data scores on the same scorecard.
Component 3, Train in increments and plot the curve
Do not train once on the full dataset. Train in increments (often 25 percent, 50 percent, 75 percent, 100 percent) and score each on the evaluation suite. The resulting curve answers:
- Where is the inflection point on each dimension?
- Which dimensions still benefit from more data?
- Which dimensions are noise-bound (additional data does not move the score)?
- Is the source-task regression suite still passing at every increment?
The curve, not the final number, is the basis for the next sizing decision.
Component 4, Reallocate budget to data quality past the inflection
Past the inflection point, additional typical examples produce diminishing returns. Reallocation options:
- Edge cases. Label examples that fail the current model; targeted labels move the curve more than typical examples.
- Deduplication. Remove near-duplicates that bias the loss without adding signal.
- Reweighting. Upweight underrepresented slices; sample uniformly across slices for evaluation.
- Human-in-the-loop refinement. Replace noisy labels with cleaner ones on the examples where current labels disagree.
Data quality work is a measurable activity. Score the post-quality dataset against the same evaluation suite; if it lifts the curve, the quality work paid off.
Component 5, Match the training method to the dataset size
- Full fine-tuning updates every parameter. Highest forgetting risk; rarely needed when LoRA or adapters work.
- LoRA and QLoRA train fewer than 1 percent of parameters; effective with small datasets, low forgetting risk, easy to roll back.
- Adapters insert small trainable modules between transformer layers; slightly larger than LoRA, useful for multi-task transfer.
- Few-shot prompting is the zero-cost baseline; sometimes the right answer is no fine-tuning at all.
- RAG sidesteps weight changes entirely; ideal when the knowledge base changes faster than weights can be retrained.
The matching matters. A 500-example dataset with full fine-tuning often regresses source tasks; the same dataset with LoRA usually does not.
Example
A team builds a five-category customer-support classifier on a 7B base model.
Evaluation suite first. Four dimensions: classification accuracy, refusal correctness on out-of-scope inputs, tone, and source-task regression probe (200 examples drawn from general benchmarks). Calibrated LLM judges for refusal correctness and tone; deterministic comparisons for classification accuracy and source-task probe.
Increment one: 50 examples, LoRA rank 16. Classification accuracy 0.78, refusal 0.72, tone 0.81, source-task probe 0.94 (no regression). Below quality bar.
Increment two: 100 examples. Accuracy 0.86, refusal 0.79, tone 0.84, source-task probe 0.94. Approaching bar.
Increment three: 150 examples. Accuracy 0.90, refusal 0.83, tone 0.86, source-task probe 0.93. Above bar on accuracy and tone; refusal still below threshold.
Increment four: 300 examples. Accuracy 0.92, refusal 0.87, tone 0.87, source-task probe 0.92. All dimensions above bar.
Increment five: 600 examples. Accuracy 0.93, refusal 0.88, tone 0.87, source-task probe 0.91. Diminishing returns clear; source-task probe still acceptable.
Reallocation. The team stops expanding the typical dataset. Budget is reallocated to label 80 edge cases (ambiguous routing decisions, multilingual inputs, escalation triggers). Post-reallocation: accuracy 0.94, refusal 0.91, tone 0.87. The targeted labels moved refusal more than another 600 typical examples would have.
Outcome. The team's final dataset is 380 examples. The evaluation suite is the artifact that justified stopping at that size; the per-dimension curve documented the trade-off. The training method (LoRA rank 16) kept forgetting risk bounded.
Limitations
- Scaling laws are shapes, not targets. Power-law fits give shape; the actual inflection on your task can only be found empirically.
- Quality is hard to measure cheaply. Inter-rater agreement, label noise, and edge-case coverage all matter and are not trivially summarized.
- Catastrophic forgetting is bounded, not eliminated. Every adaptation costs some general capability; track it explicitly with a source-task regression probe.
- Evaluation noise can hide diminishing returns. Small score movements between increments may be evaluator variance, not real signal. Run each increment at least twice and report variance.
- Synthetic data has its own risks. Self-training, pseudo-labeling, and synthetic generation can amplify errors without strict confidence thresholds and a clean human-labeled validation set.
Evidence and sources
- Scaling Laws for Neural Language Models (Kaplan et al.), arxiv.org/abs/2001.08361
- LoRA: Low-Rank Adaptation of Large Language Models, arxiv.org/abs/2106.09685
- Training Compute-Optimal Large Language Models (Chinchilla), arxiv.org/abs/2203.15556
FAQ
How small can my fine-tuning dataset be? For simple classification, 100 to 300 examples is often enough. For structured extraction, 200 to 1,000. For generative tasks, 500 to several thousand. The right minimum is whatever clears the calibrated quality bar; only the evaluation suite can answer this.
When should I stop collecting more data? When the evaluation curve flattens on the dimensions you care about and the source-task regression probe is still passing. Past that point, budget is better spent on edge cases, deduplication, or quality.
Does quality really beat quantity? Yes, when the quality difference is meaningful. Deduplicated, task-aligned, expert-labeled data on 500 examples often beats noisy crowd-labeled data on 5,000.
LoRA or full fine-tuning? Start with LoRA. Move to full fine-tuning only when LoRA plateaus below the quality bar and the forgetting budget can absorb the cost.
Can I use synthetic data? Yes, with discipline. Pseudo-labeling requires strict confidence thresholds (typically above 0.95) and a clean human-labeled validation set. Treat synthetic data as a multiplier on existing labels, not a substitute for them.