DC AI LGDec 27, 2025

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang

arXiv:2512.22560v18.09 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This work addresses infrastructure bottlenecks for large-scale agentic RL training, enabling faster development of autonomous AI agents, though it is incremental as it builds on existing disaggregation concepts.

The paper tackles the challenge of efficiently training agentic reinforcement learning (RL) systems, which involve heterogeneous workloads, by proposing RollArc, a distributed system that uses disaggregated infrastructure to improve throughput, resulting in a 1.35-2.05× reduction in end-to-end training time compared to baselines.

Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05\(\times\) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.

View on arXiv PDF Code

Similar