DCMay 12

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

arXiv:2409.0114387.715 citationsh-index: 19
AI Analysis

For practitioners with access to heterogeneous GPU clusters, HexiScale offers a practical solution to utilize diverse hardware for LLM training without sacrificing performance.

HexiScale enables efficient LLM training over heterogeneous GPUs by supporting asymmetric partition of computations across data, pipeline, and tensor parallelism, achieving 1.5× to 2.4× higher throughput than state-of-the-art heterogeneous baselines while matching homogeneous performance on equivalent FLOPS.

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (i) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves similar performance when running over heterogeneous GPUs with the same theoretical FLOPS; (ii) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes