Sign in
Training Process

RLHF / Post-training

Turning a raw autocomplete engine into a useful, safe assistant

What it is

Reinforcement Learning from Human Feedback (RLHF) is the post-training stage that transforms a pre-trained base model into a useful chatbot. A base model knows how to predict text but can't reliably follow instructions or maintain a helpful persona.

The process: the base model generates multiple responses to the same prompt, human raters rank the responses by quality, a reward model is trained on these preferences, and then the LLM is fine-tuned using RL to maximize the reward model's score. After relatively few examples compared to pre-training, the model learns to consistently follow instructions and format responses helpfully.

This is the same mechanism used for safety alignment, teaching the model to refuse harmful requests despite having been trained on data that includes how to make those things.

Why it matters

RLHF is the difference between GPT-3 (2020, research curiosity) and ChatGPT (2022, product used by hundreds of millions). It's what makes models practically useful. It also explains why different models have different "personalities" and why safety behaviors can be inconsistent, the reward model and fine-tuning process involve significant design choices and tradeoffs.

Related concepts

Resources

Deep Dive into LLMs like ChatGPT (RLHF section)
youtube.com· Covers supervised finetuning and RLHF within the full training pipeline. Uses the great AlphaGo "Move 37" analogy for how RL helps models discover novel strategies. Covers reward models, human preference ranking, and why RLHF matters for making models helpful. General audience, no prerequisites.
20 min
Reinforcement Learning with Human Feedback (RLHF), Clearly Explained
youtube.com· StatQuest's signature step-by-step approach applied to RLHF. Covers the reward model, PPO, and how human preferences get distilled into model behavior. Tailored for people who already understand the basics of LLMs.
25 min
Illustrating Reinforcement Learning from Human Feedback (RLHF)
huggingface.co· The definitive illustrated guide to RLHF. Breaks the three-phase pipeline into clear steps (pretraining → reward model training → RL fine-tuning) with clear diagrams. Widely cited in the field.
15 min
RLHF: Reinforcement Learning from Human Feedback
huyenchip.com· Excellent deep dive from an ML engineer's perspective. Uses the "Shoggoth with a smiley face" meme/analogy to explain each training stage. Covers all three phases with mathematical formulations included for those who want them, but readable without them.
20 min
What Is RLHF?
ibm.com· Concise, high-quality overview from a trusted source. Good for recruits who want a quick primer before diving into the longer resources.
8 min