Discovering Process-Outcome Credit in Multi-Step LLM Reasoning
This addresses inefficiencies in training LLMs for reasoning tasks, offering improved performance and robustness, though it is incremental as it builds on existing RL and CoT methods.
The paper tackled the problem of reward sparsity and inefficient credit assignment in reinforcement learning for enhancing reasoning in large language models by proposing a framework with step-wise marginal information gain and decoupled masking, resulting in consistent outperformance of baselines in sample efficiency and accuracy across benchmarks like MATH and Super-CLEVR.
Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.