DCAIFeb 19, 2025

Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

arXiv:2502.13480v11 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of optimizing training performance and cost for machine learning practitioners using heterogeneous GPU setups, representing an incremental improvement in automated parallelization tools.

The paper tackles the problem of automatically searching for efficient and cost-effective parallel strategies on heterogeneous GPUs, introducing Astra, which achieves better throughput than expert-designed strategies with search times limited to 1.27 seconds for single-GPU and under 1.35 minutes for heterogeneous-GPU settings at over 95% accuracy.

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes