Reasoning Training / RLVR
How modern models learn to think step-by-step using verifiable rewards
What it is
Reinforcement Learning from Verifiable Rewards (RLVR) is the training technique behind modern reasoning models like OpenAI's o1/o3 and Claude's extended thinking mode. Unlike RLHF, which relies on human preference judgments, RLVR uses datasets where answers can be automatically verified as correct or incorrect, math problems, coding challenges, logic puzzles.
The model generates solutions, gets rewarded for correct answers, and iteratively learns strategies that work. Crucially, no one programmed chain-of-thought reasoning or backtracking, these emerged from training pressure alone. The model discovered that thinking through problems step-by-step leads to more correct answers.
The result is "test-time compute", the model can spend more reasoning tokens on harder problems, scaling its capability with task difficulty.