LG AI DCMay 2, 2024

AB-Training: A Communication-Efficient Approach for Distributed Low-Rank Learning

Daniel Coquelin, Katherina Flügel, Marie Weiel, Nicholas Kiefer, Muhammed Öz, Charlotte Debus, Achim Streit, Markus Götz

arXiv:2405.01067v22.6h-index: 7

Originality Highly original

AI Analysis

This addresses scalability issues for distributed training in HPC environments, offering incremental improvements over existing methods.

The paper tackled communication bottlenecks in distributed neural network training by introducing AB-training, a data-parallel method using low-rank representations and independent groups, resulting in an average 70.31% reduction in network traffic and a 44.14:1 compression ratio on VGG16 with minimal accuracy loss.

Communication bottlenecks severely hinder the scalability of distributed neural network training, particularly in high-performance computing (HPC) environments. We introduce AB-training, a novel data-parallel method that leverages low-rank representations and independent training groups to significantly reduce communication overhead. Our experiments demonstrate an average reduction in network traffic of approximately 70.31\% across various scaling scenarios, increasing the training potential of communication-constrained systems and accelerating convergence at scale. AB-training also exhibits a pronounced regularization effect at smaller scales, leading to improved generalization while maintaining or even reducing training time. We achieve a remarkable 44.14 : 1 compression ratio on VGG16 trained on CIFAR-10 with minimal accuracy loss, and outperform traditional data parallel training by 1.55\% on ResNet-50 trained on ImageNet-2012. While AB-training is promising, our findings also reveal that large batch effects persist even in low-rank regimes, underscoring the need for further research into optimized update mechanisms for massively distributed training.

View on arXiv PDF

Similar