CLAILGMay 21, 2025

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Tsinghua
arXiv:2505.16022v28 citationsh-index: 11Has Code
Originality Highly original
AI Analysis

This enables incentive training across a wide range of text-to-text tasks, addressing a bottleneck for broader applicability in language modeling.

The paper tackles the limitation of incentive training methods that rely on external verifiers by proposing NOVER, a verifier-free reinforcement learning framework that uses only standard supervised fine-tuning data, and it outperforms a same-size model distilled from large reasoning models by 7.7 percent.

Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes