Pinpointing crucial steps: Attribution-based Credit Assignment for Verifiable Reinforcement Learning
This addresses the challenge of balancing exploration and exploitation in RLVR for LLMs, leading to enhanced complex reasoning, though it appears incremental as it builds on existing RLVR methods.
The paper tackles the problem of inaccurate credit assignment and premature entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs, introducing ACPO, which improves exploration and exploitation to achieve significant performance gains over state-of-the-art methods on benchmarks like AIME, MATH, and AMC.
While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution-based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty-aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution-based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state-of-the-art approaches.