Training Process

Pre-training

The massive first stage that teaches a model to predict language

What it is

Pre-training is the foundational and most expensive stage of LLM development. The model is trained on internet-scale data (essentially as much text as can be collected and cleaned) with a simple objective: predict the next token given all previous tokens.

This stage uses entire data centers running for months and costs tens to hundreds of millions of dollars for frontier models. The result is a base model, a powerful autocomplete engine that has internalized patterns, facts, reasoning structures, and language from its training data, but has no instruction-following capability.

Data quality and quantity both matter. High-quality sources (textbooks, Wikipedia, peer-reviewed papers, curated code) are weighted heavily. But scale is critical, models need exposure to the full diversity of human language and knowledge.

Why it matters

Pre-training is why AI models know so much, it's where all the world knowledge comes from. It's also why they have a knowledge cutoff date, why they can write code in any language, and why they sometimes reproduce training data verbatim. The enormous cost of pre-training is why there are only a handful of organizations capable of building frontier models, which shapes the entire competitive landscape you'll be analyzing in industry conversations.