Reinforcement Learning with Conditional Expectation Reward

arXiv:2603.10624v114.81 citationsh-index: 5Has Code

Predicted impact top 3% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of applying reinforcement learning to general reasoning domains with free-form answers for researchers and practitioners, though it is incremental as it builds on RLVR by replacing rule-based verifiers with a model-based approach.

The paper tackles the limitation of Reinforcement Learning with Verifiable Rewards (RLVR) by proposing Conditional Expectation Reward (CER), which uses the large language model itself as an implicit verifier to provide graded rewards, eliminating the need for domain-specific rules and showing effectiveness across mathematical and general reasoning tasks.

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.

View on arXiv PDF Code

Similar