Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Fei Ding, Yongkang Zhang, Yeling Peng, Youwei Wang, Guoxiong Zhou, Zijian Zeng

arXiv:2605.1630218.2

Predicted impact top 15% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers training LLMs on multi-step reasoning tasks, this work offers a method to stabilize training and improve performance by addressing the credit assignment problem.

The paper addresses credit assignment in reinforcement learning for LLM multi-step reasoning, where sparse terminal rewards cause high variance and unstable training. The proposed IBPO framework reduces gradient variance and improves performance, achieving significant gains on math and code reasoning benchmarks.

Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable training, and numerous ineffective updates, ultimately causing the model to fail and preventing sustained improvement. We introduce a counterfactual comparison-based credit assignment framework, which samples multiple reasoning trajectories under the same input. By treating their differences as an implicit approximation of alternative decisions, we construct an implicit process-level advantage estimator that transforms sparse terminal rewards into step-sensitive learning signals. Based on this, we propose Implicit Behavior Policy Optimization (IBPO), which significantly improves training stability and performance upper bounds on mathematical and code reasoning benchmarks, pointing to a promising direction for unlocking the performance potential of LLMs.

View on arXiv PDF

Similar