Training Process

Reasoning Training / RLVR

How modern models learn to think step-by-step using verifiable rewards

What it is

Reinforcement Learning from Verifiable Rewards (RLVR) is the training technique behind modern reasoning models like OpenAI's o1/o3 and Claude's extended thinking mode. Unlike RLHF, which relies on human preference judgments, RLVR uses datasets where answers can be automatically verified as correct or incorrect, math problems, coding challenges, logic puzzles.

The model generates solutions, gets rewarded for correct answers, and iteratively learns strategies that work. Crucially, no one programmed chain-of-thought reasoning or backtracking, these emerged from training pressure alone. The model discovered that thinking through problems step-by-step leads to more correct answers.

The result is "test-time compute", the model can spend more reasoning tokens on harder problems, scaling its capability with task difficulty.

Why it matters

Reasoning models changed what AI can actually accomplish. Problems that stumped instruction-tuned models (complex math, multi-step coding, careful logical deduction) became tractable. When you see "thinking" tokens in Claude or o3, RLVR is why they exist. It's also why the question "which is bigger, 9.11 or 9.9?" now gets answered correctly.

Related concepts

Reasoning Models Synthetic Data

Resources

Deep Dive into LLMs like ChatGPT (DeepSeek-R1 / RL section)

youtube.com· Walks through how DeepSeek-R1 uses RL to develop reasoning, the "aha moment" discovery, and parallels to AlphaGo. Explains the paradigm shift from RLHF to RLVR and how RL with verifiable rewards produces chain-of-thought behavior. General-audience framing.

How DeepSeek-R1 Was Built; For Dummies

vellum.ai· Accessible step-by-step breakdown of the R1 training pipeline. Explains cold-start data, pure RL (RLVR), rejection sampling, and distillation without assuming deep technical knowledge.

How to Train LLMs to "Think" (o1 & DeepSeek-R1)

towardsdatascience.com· Clear technical walkthrough of how reasoning training works: thinking tokens, RLVR (RL with verifiable rewards), the R1-Zero experiment, and the multi-stage R1 pipeline. Well-illustrated.

How Reasoning Works in DeepSeek-R1

mccormickml.com· Demystifies reasoning models. Key insight: at foundation, it's about telling the model to "think before you answer" using structured tags, then reinforcing correct reasoning with RL rewards. Cuts through hype to explain the core mechanism clearly.

The State of RL for LLM Reasoning

magazine.sebastianraschka.com· More technical but still accessible overview of RLVR (Reinforcement Learning with Verifiable Rewards) and GRPO. Covers the full DeepSeek model family evolution. Good for intermediate recruits.

What did DeepSeek figure out about reasoning?

seangoedecke.com· Short, clear explanation from a non-ML-expert's perspective. Explains the key innovation: using code verification instead of model judges as the reward signal, making training cheaper and potentially better.

Reasoning Models Transformed the Industry

deeplearning.ai· Quick high-level overview of how reasoning models emerged and why they matter. Good as a fast primer before the deeper resources.

PreviousRLHF / Post-training

NextEnd of section