RLHF / Post-training
Turning a raw autocomplete engine into a useful, safe assistant
What it is
Reinforcement Learning from Human Feedback (RLHF) is the post-training stage that transforms a pre-trained base model into a useful chatbot. A base model knows how to predict text but can't reliably follow instructions or maintain a helpful persona.
The process: the base model generates multiple responses to the same prompt, human raters rank the responses by quality, a reward model is trained on these preferences, and then the LLM is fine-tuned using RL to maximize the reward model's score. After relatively few examples compared to pre-training, the model learns to consistently follow instructions and format responses helpfully.
This is the same mechanism used for safety alignment, teaching the model to refuse harmful requests despite having been trained on data that includes how to make those things.