Scaling & Data

Scaling Laws

The mathematical relationships that predict how AI models improve with scale

What it is

Scaling laws are empirical mathematical relationships that describe how model performance improves predictably with increases in model size, training data, and compute. The key finding from DeepMind's Chinchilla paper: given a fixed compute budget, there's an optimal allocation between model size and data quantity.

The Chinchilla optimal ratio suggests training a model on approximately 20 tokens per parameter (e.g., a 7B model should train on ~140B tokens for compute-optimal training). Earlier models like GPT-3 were significantly undertrained by this metric.

Scaling laws are why AI labs can make confident predictions about what a training run will achieve before spending the compute.

Why it matters

Scaling laws are the intellectual foundation for the massive investment in AI compute. "We know spending more will make the model better, and we can predict by how much" is an unusually strong basis for capital allocation. They also explain why model development strategy changed significantly after the Chinchilla paper, labs shifted from training bigger models to training more data-efficient ones.

Related concepts

Pre-training Training Costs Emergent Abilities

Resources

Training Compute-Optimal Large Language Models (Chinchilla paper)

arxiv.org· The original paper, recruits should read abstract and findings for the "double data, double params" insight

15 min

What is the Chinchilla Scaling Law?

analyticsvidhya.com· Beginner-friendly with FAQ format, covers the ~20 tokens per parameter ratio and recent challenges from Llama 3

12 min

Chinchilla Data-Optimal Scaling Laws: In Plain English

lifearchitect.ai· Plain-language breakdown comparing Kaplan (GPT-3) vs Chinchilla vs beyond-Chinchilla thinking with concrete numbers

10 min