LGAIJun 28, 2024

Enhancing Stability for Large Language Models Training in Constrained Bandwidth Networks

arXiv:2407.01614v3
Originality Incremental advance
AI Analysis

This addresses convergence issues for researchers and engineers training massive LLMs on constrained networks, though it is incremental as it builds on existing ZeRO++ methods.

The paper tackled instability in training large language models (LLMs) with billions of parameters in low-bandwidth networks, caused by race conditions in hierarchical partitioning, and proposed a modified algorithm that achieved reliable convergence with 98% throughput improvement.

Training extremely large language models (LLMs) with billions of parameters is a computationally intensive task that pushes the limits of current data parallel training systems. While techniques like ZeRO++ have enabled efficient distributed training of such giant models on inexpensive low-bandwidth clusters, they can suffer from convergence issues due to potential race conditions in the hierarchical partitioning (hpZ) scheme employed to reduce cross-machine communication. In this work, we first show how these race conditions cause instability when training models with billions of parameters. We then propose a modification to the partitioning algorithm that addresses these convergence challenges while maintaining competitive training efficiency. Empirical evaluation on training the multi-billion parameters Falcon Models and Llama-2 models demonstrates the updated algorithm's ability to achieve reliable convergence on these massive models, where stock ZeRO++ hpZ fails to converge. The updated algorithm enables robust training of larger models with 98\% throughput and model training speed improvement without sacrificing the quality of convergence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes