Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms
This addresses a bottleneck in large-scale RL alignment for LLMs, enabling datacentre-grade performance with decentralised resources, though it is incremental as it optimises existing workflows.
The paper tackles the problem of inefficient serial context switching between inference and training in RL-based post-training for large language models by decoupling these phases across heterogeneous swarms, achieving matching convergence speed and final reward compared to a co-located baseline while offloading trajectory generation to edge hardware.
Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context switching violates the single-program-multiple-data (SPMD) assumption underlying today's distributed training systems. We present Echo, the RL system that cleanly decouples these two phases across heterogeneous "inference" and "training" swarms while preserving statistical efficiency. Echo introduces two lightweight synchronization protocols: a sequential pull mode that refreshes policy weights according to API call for minimal bias, and an asynchronous push-pull mode that streams version-tagged rollouts through a replay buffer to maximise hardware utilisation. Training four representative RL workloads with Qwen3-4B, Qwen2.5-7B, Qwen3-30B-A3B-Thinking-2507 and Qwen3-32B on a geographically distributed cluster, Echo matches a fully co-located Verl baseline in convergence speed and final reward while off-loading trajectory generation to commodity edge hardware. These promising results demonstrate that large-scale RL for LLMs could achieve datacentre-grade performance using decentralised, heterogeneous resources.