LGFeb 10

Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, Qiong Chen, Xinyu Liu

arXiv:2602.09578v12.71 citationsh-index: 9

Originality Incremental advance

AI Analysis

This addresses system-level bottlenecks for researchers and practitioners deploying large-scale LLM-based MARL, though it is incremental as it builds on existing MARL concepts with a focus on infrastructure optimization.

The paper tackles the inefficiency of existing training frameworks for large-scale multi-agent reinforcement learning (MARL) by proposing FlexMARL, an end-to-end framework that optimizes rollout, training, and their orchestration, achieving up to 7.3x speedup and 5.6x improved hardware utilization in empirical tests.

Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.

View on arXiv PDF

Similar