DCARLGOct 4, 2023

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

arXiv:2310.02784v320 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the high operational costs and inefficiencies in training and deploying large-scale ML models on distributed systems, representing a strong domain-specific optimization.

The paper tackles the problem of communication latency consuming 14-32% of GPU hours in distributed training of large ML models, and introduces MAD-Max, a performance modeling framework that achieves up to 2.24x throughput improvement for pre-training and up to 5.2x for inference.

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24x for pre-training and up to 5.2x for inference scenarios, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes