DCLGOct 11, 2019

Blink: Fast and Generic Collectives for Distributed ML

arXiv:1910.04940v1167 citations
Originality Highly original
AI Analysis

This addresses a critical bottleneck in scaling data-parallel training for ML practitioners, offering significant performance improvements over existing methods.

The paper tackles the problem of high overheads in model parameter synchronization across GPUs for distributed machine learning by proposing Blink, a collective communication library that dynamically generates optimal communication primitives, achieving up to 8x faster synchronization and reducing end-to-end training time by up to 40% compared to NCCL.

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes