Training Process

RLHF / Post-training

Turning a raw autocomplete engine into a useful, safe assistant

What it is

Reinforcement Learning from Human Feedback (RLHF) is the post-training stage that transforms a pre-trained base model into a useful chatbot. A base model knows how to predict text but can't reliably follow instructions or maintain a helpful persona.

The process: the base model generates multiple responses to the same prompt, human raters rank the responses by quality, a reward model is trained on these preferences, and then the LLM is fine-tuned using RL to maximize the reward model's score. After relatively few examples compared to pre-training, the model learns to consistently follow instructions and format responses helpfully.

This is the same mechanism used for safety alignment, teaching the model to refuse harmful requests despite having been trained on data that includes how to make those things.

Why it matters

RLHF is the difference between GPT-3 (2020, research curiosity) and ChatGPT (2022, product used by hundreds of millions). It's what makes models practically useful. It also explains why different models have different "personalities" and why safety behaviors can be inconsistent, the reward model and fine-tuning process involve significant design choices and tradeoffs.