JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao

arXiv:2601.08468v11.62 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the efficiency and generalization issues in reasoning tasks for large language models, representing an incremental improvement over existing RLVR methods.

The paper tackles the problem of inefficient and verbose exploration in Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning in Large Language Models by proposing JudgeRLVR, a two-stage judge-then-generate paradigm that improves accuracy and reduces generation length, achieving about +3.7 points average accuracy gain with -42% average generation length on in-domain math and about +4.5 points average accuracy improvement on out-of-domain benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

View on arXiv PDF

Similar