Sign in
Scaling & Data

Synthetic Data

Using AI to generate training data for AI, and why it's becoming essential

What it is

Synthetic data is training data generated by an AI model rather than collected from humans. As the volume of high-quality human-generated text on the internet approaches saturation, labs are turning to LLMs to generate additional training examples.

Applications include: generating diverse paraphrases of existing content, creating question-answer pairs from documents, generating code with known-correct solutions for RLVR training, and creating domain-specific training examples that don't exist in public data.

The key concern is model collapse, if you train on AI-generated data, then use that model to generate more data, recursive degradation can occur if not carefully managed.

Why it matters

Synthetic data is what makes RLVR training scalable, you can generate essentially unlimited math and coding problems with verifiable answers. It's also a key tool for teams that need domain-specific fine-tuning data that doesn't exist publicly. Understanding the tradeoffs (cost, quality, collapse risk) is important for any team doing model training or fine-tuning.

Related concepts

Resources

Synthetic Data Generation in 2025: Scale ML Training Smartly
cleverx.com· Practical overview of GANs, diffusion models, LLM-based synthetic data with industry examples
12 min
LLM + Data: Building AI with Real & Synthetic Data
youtube.com· Covers how LLMs rely on data, challenges of creating diverse datasets, role of synthetic data, and how data work tackles bias
12 min
AI models collapse when trained on recursively generated data
nature.com· The landmark paper defining "model collapse", demonstrates that indiscriminate use of AI-generated training data causes irreversible defects
15 min
Examining synthetic data: The promise, risks and realities
ibm.com· Balanced overview covering both sides, NVIDIA's Nemotron-4 340B, IBM Research's LAB method, quality concerns, and the data wall problem
10 min