TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents
For researchers working on reinforcement learning for tool-augmented multimodal search agents, this paper addresses a systematic failure mode (credit misassignment) with a lightweight, plug-and-play solution.
The paper identifies credit misassignment in GRPO for tool-augmented multimodal search agents, where valuable tool-use steps in failing trajectories are penalized equally to valueless ones. It proposes TAPO, which corrects this via counterfactual witnesses and confidence-gated conservative advantage correction, achieving consistent improvements over three RL baselines across multiple benchmarks.
We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.