AIMay 11

Verifiable Process Rewards for Agentic Reasoning

Huining Yuan, Zelai Xu, Huaijie Wang, Xiangmin Yi, Jiaxuan Gao, Xiao-Ping Zhang, Yu Wang, Chao Yu, Yi Wu

arXiv:2605.1032584.2

Predicted impact top 29% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on reinforcement learning for LLM agents, VPR provides a method to improve credit assignment in tasks with verifiable intermediate actions, though it depends on oracle quality and is limited to structured environments.

The paper tackles credit assignment in long-horizon agentic reasoning for LLMs by introducing Verifiable Process Rewards (VPR), which uses dense turn-level supervision from symbolic or algorithmic oracles. VPR outperforms outcome-level and rollout-based process reward baselines in controlled environments and transfers to general and agentic reasoning benchmarks.

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.

View on arXiv PDF

Similar