DCLGPLMay 12, 2021

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

arXiv:2105.05720v5100 citations
Originality Highly original
AI Analysis

This addresses the problem of inefficient distributed training and inference for large models, offering a novel optimization approach that is incremental in automating manual kernel modifications.

The paper tackles the performance limitations in distributed machine learning due to the separation of computation and communication kernels by introducing CoCoNeT, a system that unifies them with a DSL and compiler, resulting in significant performance improvements over state-of-the-art implementations.

Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes