CL AI MASep 17, 2025

Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

Weiting Tan, Xinghua Qu, Ming Tu, Meng Ge, Andy T. Liu, Philipp Koehn, Lu Lu

arXiv:2509.14480v113.98 citationsh-index: 6

Originality Incremental advance

AI Analysis

This work addresses the challenge of developing more natural, voice-driven interactive agents for multimodal tool use, representing an incremental advance in agent training methods.

The paper tackled the problem of training agents for interactive multimodal tool use by introducing a sandbox environment and a reinforcement learning strategy called TARL, which improved the task pass rate on the text-based τ-bench by over 6% compared to strong baselines and enabled fine-tuning of a multimodal foundation model for agentic tasks.

Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $τ$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.

View on arXiv PDF

Similar