Chinchilla Scaling Laws: Three Methods and Why Labs Ignore Them

Chinchilla Scaling Laws: Three Methods and Why Labs Ignore Them
Chinchilla Scaling Laws: Three Methods and Why Labs Ignore Them

The conventional wisdom in AI research held that bigger models were better models. When OpenAI released GPT-3 in 2020 with 175 billion parameters, the field’s implicit assumption was that scale in parameters was the primary lever for capability. More parameters, better performance.

In March 2022, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, and 14 colleagues at DeepMind published a paper that dismantled this assumption. They trained a 70 billion parameter model called Chinchilla on four times more data than the 280 billion parameter Gopher, using the same computational budget for both. Chinchilla outperformed Gopher on nearly every benchmark. It outperformed GPT-3 (175B parameters), Jurassic-1 (178B), and Megatron (530B).

The field had been training its models wrong. Not slightly wrong. Fundamentally wrong, for years.

What the Chinchilla paper established, and what is routinely mischaracterized in summaries, is not a simple rule about a 20:1 token-to-parameter ratio. It is a set of empirical formulas, derived through three independent methods with some disagreement between them, for how to allocate a fixed compute budget between model size and training data to minimize loss. The details of what the paper actually shows, and how practitioners have deliberately violated its recommendations for economic reasons, matter for anyone making decisions about model training.

The Kaplan Scaling Laws: What Chinchilla Overturned

To understand what Chinchilla changed, it helps to understand what it changed from. In 2020, Jared Kaplan, Sam McCandlish, Tom Henighan, and colleagues at OpenAI published “Scaling Laws for Neural Language Models,” establishing that language model performance (measured as cross-entropy loss) follows power laws with model size, dataset size, and compute budget.

The Kaplan paper’s recommendation for fixed-compute training was to scale model size faster than dataset size. As compute budget grows, you should increase parameters more aggressively than tokens. The practical implication was to train very large models on relatively small amounts of data. GPT-3 (175B parameters, approximately 300 billion training tokens) followed this prescription.

The Kaplan paper fitted its scaling laws on models up to roughly 1.5 billion parameters and extrapolated upward. The DeepMind researchers in 2022 ran a more systematic series of experiments specifically designed to find the optimal model-size-to-data ratio at each compute budget, using models between 70 million and 16 billion parameters across over 400 training runs. What they found contradicted Kaplan’s extrapolations.

Three Methods, One Conclusion (With Some Disagreement)

The Chinchilla paper is notable for using three independent approaches to estimate the optimal model size and token count for a given compute budget. All three converge on the conclusion that model size and training data should scale together. The details diverge in ways that matter practically.

Method 1 (Fixing model size, varying tokens): The researchers trained groups of models at fixed parameter counts and varied the number of training tokens, finding the training token count that minimized final loss for each parameter count. This approach suggests that optimal token count scales approximately linearly with parameter count.

Method 2 (IsoFLOP profiles): The researchers held total compute budget constant and varied the allocation between model size and tokens. For each compute budget, they found the model size that minimized final loss. This is the most direct answer to the practical question of compute-optimal training, and it also suggests near-linear scaling of parameters and tokens together.

Method 3 (Fitting parametric loss models): The researchers fitted an explicit formula L(N, D) = E + A/N^alpha + B/D^beta to the training loss as a function of parameters N and data D, where E is the irreducible loss (Bayes entropy of the data), and alpha and beta are fitted exponents. This formula allows computing the optimal parameter count as a function of compute budget analytically.

Methods 1 and 2 produce similar predictions. Method 3 produces a somewhat different prediction. Specifically, it suggests that the optimal parameter count should grow more slowly with compute, meaning models should be trained on even more data relative to parameters than Methods 1 and 2 suggest. This disagreement between the methods is rarely discussed in summaries of the paper, but it means the “20:1” ratio from Chinchilla is an average across methods with real variance.

What the 20:1 Rule Actually Says

The commonly cited rule is that optimal training uses approximately 20 training tokens per model parameter. A 70B model should be trained on approximately 1.4 trillion tokens. A 7B model on approximately 140 billion tokens.

The Chinchilla 70B model was trained on approximately 1.4 trillion tokens, which matches this ratio. GPT-3 (175B parameters, 300 billion tokens) used a ratio of roughly 1.7:1, massively undertrained by Chinchilla’s analysis. This is why Chinchilla with fewer parameters outperformed GPT-3 despite using the same training compute: Gopher and GPT-3 were both dramatically undertrained for their parameter counts.

But the 20:1 rule applies specifically to training compute optimization, meaning the model size and data allocation that minimizes loss for a given number of FLOPs. It says nothing about what matters for inference deployment, how to balance training cost against serving cost, or what happens when you train far beyond the Chinchilla-optimal point.

Llama Violated Chinchilla on Purpose

Meta’s LLaMA models, released in 2023, were deliberately trained beyond Chinchilla-optimal data quantities. LLaMA-7B was trained on approximately 1 trillion tokens, roughly 7x more than Chinchilla’s training-optimal recommendation for a 7B model. LLaMA-13B was trained on approximately 1.4 trillion tokens, far more than the Chinchilla-optimal 260 billion.

Hugo Touvron, Thibaut Lavril, and colleagues at Meta explained the reasoning explicitly in the LLaMA paper. The Chinchilla-optimal model is the one that achieves a given performance level with the minimum training compute. But that is not the same as the best model for a given inference budget. If you plan to run a model billions of times after training, the inference cost of a larger model accumulates. A smaller model trained on more data achieves the same performance level as a larger Chinchilla-optimal model, at a fraction of the per-query inference compute.

The economic incentive to violate Chinchilla’s training-optimal recommendation is enormous when inference is the dominant cost. Meta was releasing LLaMA for researchers to run on their own hardware. Minimizing parameter count while maintaining capability was more valuable to that use case than minimizing training FLOPs. Smaller inference-optimal models are also more amenable to parameter-efficient fine-tuning with LoRA, since lower-rank weight updates suffice for well-pre-trained models. The same logic applies to any production deployment serving many requests.

Beyond Chinchilla: Inference-Adjusted Optimal

Nikhil Sardana and Jonathan Frankle at MosaicML (now Databricks) formalized this reasoning in “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws” (arXiv:2401.00448, 2024).

The paper modifies Chinchilla’s optimization to include inference cost. If you expect to serve a model for R total inference requests after training, the total compute cost is training FLOPs plus R times per-request inference FLOPs. For large R (millions to billions of requests), the optimal parameter count is smaller than Chinchilla-training-optimal, because a smaller model that required more training data to reach the same capability pays less inference cost per request. Techniques like speculative decoding reduce per-request latency further, but they address serving overhead rather than the fundamental model size tradeoff.

Sardana and Frankle found that for models expecting approximately one billion inference requests, the optimal parameter count is substantially smaller than the Chinchilla-training-optimal model at the same capability level, and models should be trained on far more data. They validated their formula against 47 models of varying sizes and found that model quality continues to improve as tokens-per-parameter ratios extend to 10,000:1, orders of magnitude beyond Chinchilla’s 20:1.

This is the theoretical foundation for why Mistral AI, Google, and others now train models that appear significantly undertrained by Chinchilla’s original standard. They are optimizing for total deployment cost, not training cost alone.

The Data Wall and the 2025-2026 Explosion in Token Counts

The beyond-Chinchilla direction requires more data. Substantially more. If Chinchilla suggested 20 tokens per parameter was optimal for training compute, and inference-adjusted optimal pushes toward hundreds or thousands of tokens per parameter, then the field needs vastly larger training datasets.

The response has been to extend training far beyond what was previously considered necessary. Alibaba’s Qwen3-0.6B model, released in April 2025, was trained on 36 trillion tokens, a tokens-to-parameters ratio of 60,000:1. Liquid AI’s LFM2.5-350M, released in April 2026, achieved a ratio of 80,000:1. These ratios are not errors or experimental outliers. They reflect deliberate choices to train small models on very long data runs to extract maximum inference-time efficiency from compact architectures.

The trend has a name: inference-optimal training. Its economic logic is straightforward. Training is a one-time cost. Inference is an ongoing cost. As models get deployed at scale, the ratio of inference cost to training cost grows toward infinity. At the limit, the training budget is a rounding error and the only thing that matters is per-query compute, which is determined by model size. Chinchilla-optimal is the right answer only if you plan to train a model and never run it.

What the Loss Formula Predicts and Where It Fails

The parametric loss model L(N, D) = E + A/N^alpha + B/D^beta from Method 3 of the Chinchilla paper has become a standard tool for predicting training loss before spending compute. The formula allows extrapolating from small training runs to predict the loss of a much larger model or longer training run, provided the compute stays in a regime the formula was fitted on.

The key parameters: E is the irreducible entropy of the training data (the minimum loss any model can achieve, determined by the data distribution). A and alpha govern how loss decreases with more parameters. B and beta govern how loss decreases with more tokens. The original Chinchilla paper fitted these constants on MassiveText, a specific data mixture. Models trained on different data distributions, different architectures, or different tokenizers require re-fitting these constants.

The failure mode is extrapolation beyond the fitted range. The scaling laws are empirical power laws, not theoretical derivations. At very large scales, the regime of 100B+ parameter models trained on trillions of tokens, there is no guarantee that the same fitted constants apply. Frontier labs run their own internal scaling law experiments before each major training run precisely because external estimates are unreliable at their scale.

A 2024 paper (“Reconciling Kaplan and Chinchilla Scaling Laws,” Microsoft and MIT) analyzed why Kaplan and Chinchilla disagree and found that batch size schedules, learning rate schedules, and the maximum training token count used in the Kaplan experiments all contributed to systematically biased estimates that favored larger models. The Chinchilla estimates are more reliable for practical compute allocation, but both sets of constants should be treated as empirical approximations rather than physical constants.

The Three Quantities That Define a Training Run

For practitioners, the Chinchilla framework reduces to three interrelated quantities: compute budget C (measured in FLOPs), model size N (parameters), and dataset size D (tokens). Given any two, the third is approximately determined by the scaling laws.

The approximate training FLOP count for a transformer is C = 6ND, counting the forward pass and backward pass. At a fixed compute budget C, the Chinchilla-optimal split is approximately N proportional to C^0.5 and D proportional to C^0.5, scaling both parameters and tokens as the square root of compute. This is the equal-scaling prescription: a 10x larger compute budget should produce a roughly 3x larger model trained on 3x more data.

The inference-adjusted variant shifts this split toward smaller N and larger D as inference volume grows. At the extreme (effectively infinite inference requests), the optimal N approaches zero and D approaches infinity, which practically means using the smallest model that can be trained to the required capability level on however much data is available.

Scaling Laws Beyond Language

The Chinchilla framework was developed for language modeling but has been applied to other modalities, including vision-language models where the interaction between visual encoder scale and language backbone adds a third variable to the training allocation problem.

The Arc Institute’s Evo 2 genomic foundation model, trained on 9.3 trillion DNA bases with a 40B parameter architecture, implicitly applies inference-optimal reasoning to genomics: a parameter count that fits on accessible hardware, trained on the largest available biological sequence corpus, optimizing for inference utility rather than training compute efficiency.

Protein language models face a different scaling challenge because biological sequence data has hard limits. ESM3’s 98B parameter architecture was trained on sequences from EvolutionaryScale’s compiled database of roughly 771 million protein sequences, which is fixed by biology rather than web crawl size. The scaling law analysis that applies to language models must account for data availability ceilings that do not exist in natural language.

What Current Frontier Models Tell Us About Scaling Law Compliance

As of mid-2026, the major frontier models all train beyond Chinchilla-optimal in terms of data for their parameter count. The training compute numbers that labs disclose suggest tokens-to-parameter ratios well above 20:1 for most production-ready models. This is not an accident or an error.

The labs have internalized the inference-optimal argument. They are building models designed to be run at scale, not to minimize training FLOPs. The original Chinchilla prescription was correct for its time, it correctly diagnosed that GPT-3 and Gopher were dramatically undertrained, but the field has moved to a different optimization objective that Chinchilla’s original framing did not address.

The practical implication for anyone training a model today: Chinchilla’s 20:1 ratio is a floor, not a target. If you plan to run the model more than a few hundred thousand times, you should train longer than Chinchilla-optimal and use a smaller model than the Chinchilla-training-optimal recommendation. The inference-adjusted formula from Sardana and Frankle (arXiv:2401.00448) gives the quantitative framework for computing the optimal trade-off given your expected inference volume.

The Limits of Scaling Laws

Scaling laws predict continuous improvement with scale, but do not tell you when qualitative phase transitions occur. Emergent capabilities, the sudden jumps in performance on specific tasks as model scale crosses certain thresholds, are not predicted by the power law formulas. They appear as discontinuities relative to the smooth scaling trend.

The scaling laws also do not account for data quality. The formula assumes a fixed data distribution. Adding lower-quality data does not produce the same loss reduction as adding high-quality data, which means that effective data scale is not simply proportional to raw token count. As labs push toward 100-trillion-token training sets, data curation and filtering become as important as raw volume, a constraint the original scaling law formulas have no term for.

The model architecture is also held constant in all published scaling law analyses. Mixture-of-experts architectures, state space models (Mamba, etc.), and other alternatives to dense transformers have different scaling behaviors that require separate empirical fitting. The Chinchilla constants are specific to dense transformer architectures, and applying them to novel architectures produces unreliable predictions.

For the field, scaling laws remain the best available tool for planning large training runs, as empirical extrapolations from fitted models rather than physical laws. Every major lab treats them as a starting estimate to be validated by small-scale experiments before committing to a full training run, and the specific constants should always be re-fitted to the target architecture and data distribution rather than borrowed from published papers.

Discover more from My Written Word

Subscribe now to keep reading and get access to the full archive.

Continue reading