Sign in
Training Process

Pre-training

The massive first stage that teaches a model to predict language

What it is

Pre-training is the foundational and most expensive stage of LLM development. The model is trained on internet-scale data (essentially as much text as can be collected and cleaned) with a simple objective: predict the next token given all previous tokens.

This stage uses entire data centers running for months and costs tens to hundreds of millions of dollars for frontier models. The result is a base model, a powerful autocomplete engine that has internalized patterns, facts, reasoning structures, and language from its training data, but has no instruction-following capability.

Data quality and quantity both matter. High-quality sources (textbooks, Wikipedia, peer-reviewed papers, curated code) are weighted heavily. But scale is critical, models need exposure to the full diversity of human language and knowledge.

Why it matters

Pre-training is why AI models know so much, it's where all the world knowledge comes from. It's also why they have a knowledge cutoff date, why they can write code in any language, and why they sometimes reproduce training data verbatim. The enormous cost of pre-training is why there are only a handful of organizations capable of building frontier models, which shapes the entire competitive landscape you'll be analyzing in industry conversations.

Related concepts

Resources

Deep Dive into LLMs like ChatGPT (Pre-training section)
youtube.com· The single best explanation of how LLMs are pre-trained. Covers data collection (Common Crawl, FineWeb), data cleaning, tokenization, next-token prediction, neural network architecture, training GPT-2 as a worked example, and inference. Designed for general audiences. Recommend at minimum the first hour.
75 min
Let's Build GPT: from scratch, in code, spelled out
youtube.com· Builds a GPT from scratch in code. The first 30 minutes provide excellent conceptual grounding in how pre-training works: data, tokenization, next-token prediction, loss functions, and the training objective. Technical but well-explained, accessible to intermediate learners.
117 min
How do Transformers work?
huggingface.co· Covers pre-training vs fine-tuning, self-supervised learning, and the different pre-training objectives (autoregressive vs masked language modeling). Concise and authoritative.
10 min
How LLMs Work: Pre-Training to Post-Training
towardsdatascience.com· Distills key concepts from Karpathy's 3.5-hour video into a concise written format. Covers data collection, next-token prediction, and inference clearly.
10 min
Understanding LLM Pre-Training and Custom LLMs
databricks.com· Industry-oriented overview of pre-training. Covers data preparation, tokenization, transformer training, and evaluation. Good for understanding why pre-training matters in practice.
10 min
PreviousBeginning of section
NextRLHF / Post-training