LG DCFeb 18, 2021

Peering Beyond the Gradient Veil with Distributed Auto Differentiation

Bradley T. Baker, Aashis Khanal, Vince D. Calhoun, Barak Pearlmutter, Sergey M. Plis

arXiv:2102.09631v35.52 citations

Originality Highly original

AI Analysis

This addresses communication bottlenecks in distributed machine learning, offering a novel alternative to gradient-centric algorithms for researchers and practitioners.

The paper tackles the communication overhead in distributed deep learning by introducing distributed auto-differentiation (dAD), which leverages the outer-product structure of gradients for more efficient training, showing improved performance over state-of-the-art methods on transformers with large-scale datasets.

Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth. We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation. The exposed structure of the gradient evokes a new class of distributed learning algorithm, which is naturally more communication-efficient than full gradient sharing. Our approach, called distributed auto-differentiation (dAD), builds off a marriage of rank-based compression and the innate structure of the gradient as an outer-product. We demonstrate that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets. The future of distributed learning, we determine, need not be dominated by gradient-centric algorithms.

View on arXiv PDF

Similar