What is RLVR? The AI Training Trick Behind Every Reasoning Model in 2026

May 14, 20268 min read

rlvr
what is rlvr
reinforcement learning verifiable rewards
ai reasoning models
deepseek r1
how ai is trained

There is a technique quietly powering almost every impressive AI reasoning model you have heard about in the last year. It is behind OpenAI's o3, DeepSeek-R1, and the reasoning capabilities in the latest versions of Claude. It is called RLVR — Reinforcement Learning with Verifiable Rewards — and it represents one of the most important shifts in how AI models are trained.

You have probably never heard the acronym. That is about to change.

The Problem RLVR Solves

To understand RLVR, you need to understand what came before it.

For years, the dominant method for making AI models helpful was RLHF — Reinforcement Learning from Human Feedback. The idea was straightforward: show the model two responses, have a human pick the better one, and train the model to produce more of what humans prefer. Repeat this millions of times and you get a model that feels helpful, conversational, and safe.

RLHF is why ChatGPT felt like a leap forward when it launched. It is what made AI assistants feel less like a search engine and more like a conversation.

But RLHF has a fundamental problem: humans are expensive, slow, and inconsistent.

To scale RLHF, you need thousands of human raters making millions of judgments. Those judgments are subjective — two people can disagree about which response is better, especially for complex technical questions. And the process does not scale cleanly. As you want smarter models, you need smarter raters, and smarter raters are harder to find and more expensive to hire.

There is also a deeper issue: if you train a model to do what humans prefer, it learns to produce responses that sound right — not necessarily responses that are right. A confident, well-structured wrong answer can score better than a hesitant correct one. Over time, the model gets better at seeming correct rather than being correct.

RLVR was built to fix all of this.

What RLVR Actually Is

RLVR stands for Reinforcement Learning with Verifiable Rewards.

The core idea is simple: instead of asking a human "which response is better?", you check the response against an objective verifier that tells you whether the answer is actually correct.

In math: did the model get the right answer? A computer can check this instantly and perfectly.

In code: does the program pass the unit tests? Run the tests. You get a binary yes or no.

In logic: does the proof hold? A formal verification system can check.

No human needed. No subjectivity. No scale limit. The verifier runs automatically, gives a clear correct or incorrect signal, and the model trains on that signal instead of human opinion.

Andrej Karpathy, one of the most respected voices in AI, described what happens when you train this way: by optimising against automatically verifiable rewards, models spontaneously develop strategies that look like reasoning to humans. They learn to break problems into steps. They learn to check their own work. They learn to go back and try a different approach when something does not add up. Nobody programmed that behaviour in. It emerged from the training process.

That emergent reasoning is what people mean when they talk about "thinking" models.

How It Actually Works

The mechanics of RLVR involve a few moving pieces.

The model generates a response to a question — say, a maths problem. That response goes to a verifier, which checks whether the final answer is correct. The verifier returns a reward: typically 1 if correct, 0 if not. The model updates its weights to make correct responses more likely in future.

Most modern implementations use a method called GRPO — Group Relative Policy Optimisation — which DeepSeek developed and published in their R1 paper. Rather than evaluating one response at a time, GRPO generates a group of responses to the same question and compares them against each other. The model learns which response strategies within that group worked, and updates accordingly.

The reward is "outcome-only" in most implementations: the model only gets credit if the final answer is verifiably correct. It does not get partial credit for a good attempt or a well-structured wrong answer. This is deliberately harsh — and it works. The model learns that sounding confident is worthless if the answer is wrong.

One important nuance: RLVR works best on tasks where success is machine-checkable. Maths problems have right answers. Code either passes tests or it does not. Logic proofs are valid or invalid. These are the domains where RLVR has produced the most dramatic improvements.

Where it struggles is open-ended tasks. If you ask a model to write a poem or give relationship advice, there is no verifier that can tell you objectively whether the output is correct. RLVR does not help there — and RLHF, for all its limitations, still has a role in those domains.

Why This Matters — The Models It Built

RLVR is not a research concept. It is in production, powering the models you use.

DeepSeek-R1 is the clearest example. When DeepSeek published their R1 paper in early 2025, it showed a model trained heavily with RLVR achieving performance competitive with OpenAI's best models — at a fraction of the cost. The paper described how the model, through RLVR training, developed something that looked like an internal monologue: working through problems step by step, catching errors, reconsidering approaches. That behaviour was not designed. It emerged.

OpenAI's o-series models — o1, o3, and o4-mini — use RL at scale with verifiable rewards as a core part of their training. When OpenAI describes these models as "reasoning models", RLVR is a significant part of what they mean. o3 achieved results on AIME (a hard maths competition), SWE-bench (real software engineering tasks), and Codeforces (competitive programming) that would have seemed impossible two years earlier.

Claude's reasoning capabilities also draw on RLVR principles, as do most frontier models released in 2025 and 2026. The shift from pure RLHF to RLVR-inclusive training is now the industry standard for models that need to be reliably correct rather than just convincingly fluent.

The Honest Limitations

RLVR is not a magic fix. The research community has been working through its limitations seriously.

The most studied limitation is domain scope. RLVR needs a verifier — and building good verifiers is harder than it sounds. Maths and code have natural verifiers. Most real-world tasks do not. Extending RLVR to domains like chemistry, biology, and medicine requires building automated verification systems for those fields first, which is a significant engineering challenge in itself.

There is also an ongoing debate about what RLVR actually does. A paper presented at NeurIPS 2025 found that RLVR may improve sampling efficiency — making models better at finding correct answers they could already produce — more than it expands the fundamental reasoning boundary. In other words: it might make models better at using what they already know rather than teaching them genuinely new reasoning skills. This is still contested, with other papers arguing the opposite.

A third challenge is long-context performance. Standard RLVR uses outcome-only rewards — the model gets credit for the final answer, not for how it got there. For long documents or complex multi-step tasks, this creates a "vanishing learning signal" problem: the model cannot reliably connect a correct final answer to the specific reasoning steps that led to it. Research presented at ICLR 2026 addressed this directly with LongRLVR, which adds verifiable context rewards alongside answer correctness to fix this gap.

Where This Is Going

The trajectory is clear: RLVR will expand into more domains as researchers build better verifiers.

Chemistry is next in line — reaction prediction can be verified against simulation engines. Biology follows, with protein folding structures checkable against folding databases. Physics simulations can be validated computationally. Each of these domains gets RLVR when someone builds the verification infrastructure.

Beyond domain expansion, RLVR represents a broader principle that is reshaping how the industry thinks about AI improvement: optimise only against outcomes you can verify, audit, and reproduce. The models trained this way are more reliable in the domains they cover, more honest about uncertainty, and less prone to producing confident nonsense.

That is a fundamentally different kind of AI than the one trained purely to satisfy human preferences.

The One-Sentence Version

If you need to explain RLVR to someone in thirty seconds: it is how you train an AI to actually be correct, not just to sound correct — by checking its answers against a verifier instead of asking a human which response seems better.

That shift, from "what do humans prefer" to "what is provably right", is behind the reasoning models that have redefined what AI can do over the last eighteen months.

The Neuron covers AI clearly — no hype, no jargon. Subscribe to the newsletter for weekly breakdowns of what actually matters in AI.