AIJun 4

TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

arXiv:2606.0578473.8
AI Analysis

For researchers working on reinforcement learning for tool-augmented multimodal search agents, this paper addresses a systematic failure mode (credit misassignment) with a lightweight, plug-and-play solution.

The paper identifies credit misassignment in GRPO for tool-augmented multimodal search agents, where valuable tool-use steps in failing trajectories are penalized equally to valueless ones. It proposes TAPO, which corrects this via counterfactual witnesses and confidence-gated conservative advantage correction, achieving consistent improvements over three RL baselines across multiple benchmarks.

We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes