AINov 17, 2025

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue

arXiv:2511.13288v220.212 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses the need for more accurate and scalable multi-agent systems in tool-augmented reasoning tasks, though it is incremental as it builds on existing GRPO methods.

The paper tackles the problem of training multi-agent systems with distinct large language models, which face optimization challenges like variable agent frequencies and disrupted gradient flow, by proposing M-GRPO, a hierarchical extension of Group Relative Policy Optimization that includes group-relative advantages and trajectory alignment; it outperforms baselines on benchmarks like GAIA, showing improved stability and sample efficiency.

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

View on arXiv PDF

Similar