DCApr 11

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

Yongjun He, Shuai Zhang, Jiading Gai, Xiyuan Zhang, Boran Han, Bernie Wang, Huzefa Rangwala, George Karypis

Amazon

arXiv:2512.1247692.32 citationsh-index: 99

Predicted impact top 1% in DC · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a practical solution for organizations to utilize heterogeneous GPU clusters for LLM post-training, reducing reliance on scarce homogeneous high-end GPUs.

HetRL addresses the challenge of efficient reinforcement learning for LLMs in heterogeneous GPU environments, achieving up to 9.17x throughput improvement over state-of-the-art systems and 3.17x on average across diverse workloads.

As large language models (LLMs) continue to scale and new GPUs are released even more frequently, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs and alleviate the shortage of homogeneous high-end GPUs within a single availability zone. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and networks. HetRL formulates the scheduling of RL training in heterogeneous environments as a constrained joint optimization problem and provides two complementary approaches for addressing this problem: (1) a hybrid scheduling algorithm that efficiently identifies near-optimal solutions, and (2) an integer linear programming (ILP)-based scheduling algorithm that obtains optimal solutions, enabling flexible trade-offs between solution optimality and efficiency. Our extensive evaluation, consuming 20,000 GPU-hours, shows that HetRL achieves up to 9.17x the throughput of state-of-the-art systems, and 3.17x on average, across a wide range of workloads and settings.

View on arXiv PDF

Similar