LGJun 18, 2023

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

arXiv:2306.10598v24 citationsh-index: 44
Originality Incremental advance
AI Analysis

This addresses scalability issues for large-scale deep neural network training, but it is incremental as it builds on existing All-Reduce methods.

The paper tackled the problem of straggling workers limiting scalability in distributed synchronous training by analyzing compute time variability and proposing a decentralized method to reduce variation, validated on large-scale tasks with 200 Gaudi Accelerators.

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes