Synthetic Data
Using AI to generate training data for AI, and why it's becoming essential
What it is
Synthetic data is training data generated by an AI model rather than collected from humans. As the volume of high-quality human-generated text on the internet approaches saturation, labs are turning to LLMs to generate additional training examples.
Applications include: generating diverse paraphrases of existing content, creating question-answer pairs from documents, generating code with known-correct solutions for RLVR training, and creating domain-specific training examples that don't exist in public data.
The key concern is model collapse, if you train on AI-generated data, then use that model to generate more data, recursive degradation can occur if not carefully managed.