DCAICVDec 5, 2023

Flexible Communication for Optimal Distributed Learning over Unpredictable Networks

arXiv:2312.02493v23 citationsh-index: 7BigData
AI Analysis

This work addresses communication bottlenecks in distributed training for machine learning practitioners, offering an incremental optimization to enhance efficiency without sacrificing model performance.

The paper tackles the trade-off between communication cost and model accuracy in distributed deep learning by proposing a flexible communication strategy that dynamically switches between Allgather and Allreduce collectives and adjusts compression ratios, achieving high accuracy with improved training speed under varying network conditions.

Gradient compression alleviates expensive communication in distributed deep learning by sending fewer values and its corresponding indices, typically via Allgather (AG). Training with high compression ratio (CR) achieves high accuracy like DenseSGD, but has lower parallel scaling due to high communication cost (i.e., parallel efficiency). Using lower CRs improves parallel efficiency by lowering synchronization cost, but degrades model accuracy as well (statistical efficiency). Further, speedup attained with different models and CRs also varies with network latency, effective bandwidth and collective op used for aggregation. In many cases, collectives like Allreduce (AR) have lower cost than AG to exchange the same amount of data. In this paper, we propose an AR-compatible Topk compressor that is bandwidth-optimal and thus performs better than AG in certain network configurations. We develop a flexible communication strategy that switches between AG and AR based on which collective is optimal in the current settings, and model the pareto-relationship between parallel and statistical efficiency as a multi-objective optimization (MOO) problem to dynamically adjust CR and accelerate training while still converging to high accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes