Rethinking Memory and Communication Cost for Efficient Large Language Model Training
This addresses efficiency bottlenecks in large-scale LLM training, offering incremental improvements to existing distributed strategies.
The paper tackles the trade-off between memory consumption and communication costs in distributed training of large language models, proposing a memory-communication balanced strategy (PaRO) and a communication topology (HO-Ring) that improve training throughput by 1.19x-2.50x over SOTA and communication efficiency by 36.5%.
Recently, various distributed strategies for large language model training have been proposed. However, these methods provided limited solutions for the trade-off between memory consumption and communication cost. In this paper, we rethink the impact of memory consumption and communication costs on the training speed of large language models, and propose a memory-communication balanced strategy set Partial Redundancy Optimizer (PaRO). PaRO provides comprehensive options which reduces the amount and frequency of inter-group communication with minor memory redundancy by fine-grained sharding strategy, thereby improving the training efficiency in various training scenarios. Additionally, we propose a Hierarchical Overlapping Ring (HO-Ring) communication topology to enhance communication efficiency between nodes or across switches in large language model training. Our experiments demonstrate that PaRO significantly improves training throughput by 1.19x-2.50x compared to the SOTA method and achieves a near-linear scalability. The HO-Ring algorithm improves communication efficiency by 36.5% compared to the traditional Ring algorithm.