AIMay 28

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

arXiv:2605.2969792.2
Predicted impact top 16% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in agentic search and reinforcement learning, this provides a novel method to assign credit to individual steps without costly tree sampling, improving training efficiency.

The paper tackles step-level credit assignment in Agentic Search, where trajectory-level rewards are insufficient. It proposes GDCR, a step-level reward based on graph distance to the answer node, and SAPO, which combines step-level and trajectory-level advantages, achieving strong results on four benchmarks.

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes