DCAILGApr 30, 2025

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

arXiv:2504.21411v11 citationsh-index: 17Has Code
Originality Incremental advance
AI Analysis

This system addresses the challenge of complex distributed training for researchers and practitioners, offering an incremental improvement in automation and efficiency over prior methods.

The paper tackles the problem of efficiently training large-scale Foundation Models by introducing Galvatron, an automatic distributed system that identifies optimal hybrid parallelism strategies, resulting in superior throughput compared to existing frameworks as demonstrated in benchmarks.

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy, incorporating data, tensor, pipeline, sharded data, and sequence parallelism, along with recomputation. The system's architecture includes a profiler for hardware and model analysis, a search engine for strategy optimization using decision trees and dynamic programming, and a runtime for executing these strategies efficiently. Benchmarking on various clusters demonstrates Galvatron's superior throughput compared to existing frameworks. This open-source system offers user-friendly interfaces and comprehensive documentation, making complex distributed training accessible and efficient. The source code of Galvatron is available at https://github.com/PKU-DAIR/Hetu-Galvatron.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes