CLDec 25, 2025

Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

arXiv:2512.21625v115 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the optimization of reasoning abilities in large reasoning models for AI applications, representing an incremental improvement in RLVR methods.

The paper tackles the problem of how sample polarities (positive and negative self-generated rollouts) affect reinforcement learning with verifiable reward (RLVR) training dynamics and behaviors, finding that positive samples sharpen correct reasoning patterns and negative samples encourage exploration, and proposes A3PO, which improves performance across five reasoning benchmarks.

Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes