LG AIMay 7

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

Chaoli Mou, Zhan Zhuang, Xinning Chen, Yu Zhang

arXiv:2605.0596587.4

AI Analysis

For researchers improving reasoning in large language models via RLVR, S-trace offers a more efficient alternative to GRPO by selectively assigning credit to critical reasoning steps.

RLVR algorithms like GRPO suffer from uniform credit assignment, hindering learning efficiency. The proposed Selective Eligibility Traces (S-trace) achieves fine-grained credit assignment, outperforming GRPO with gains of 0.49% on Qwen3-1.7B, 3.16% on Qwen3-4B, and 2.98% on Qwen3-8B in average pass@16, while improving sample and token efficiency.

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49\% on Qwen3-1.7B and 3.16\% on Qwen3-4B, and maintaining a robust 2.98\% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.

View on arXiv PDF

Similar