DC LGSep 5, 2024

Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling

Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui

arXiv:2409.03365v38.69 citationsh-index: 24

Originality Incremental advance

AI Analysis

This addresses system bottlenecks for researchers and practitioners training complex AI models, though it is an incremental improvement in optimization methods.

The paper tackles the challenge of efficiently training multi-task, multi-modal foundation models by proposing Spindle, a new training system using wavefront scheduling, which achieves up to 71% speedup compared to state-of-the-art systems.

Recent foundation models are capable of handling multiple tasks and multiple data modalities with the unified base model structure and several specialized model components. However, efficient training of such multi-task (MT) multi-modal (MM) models poses significant system challenges due to the sophisticated model architecture and the heterogeneous workloads of different tasks and modalities. In this paper, we propose Spindle, a brand new training system tailored for resource-efficient and high-performance training of MT MM models via wavefront scheduling. The key idea of Spindle is to decompose the model execution into waves and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. We build our system and evaluate it on various MT MM models. Experiments demonstrate the superior performance and efficiency of Spindle, with speedup ratio up to 71% compared to state-of-the-art training systems.

View on arXiv PDF

Similar