PL DC LGOct 20, 2021

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

Ningning Xie, Tamara Norman, Dominik Grewe, Dimitrios Vytiniotis

arXiv:2110.10548v23.319 citations

Originality Incremental advance

AI Analysis

This work addresses performance bottlenecks in distributed deep learning training for researchers and practitioners, offering incremental improvements through automated synthesis.

The paper tackles the problem of optimizing parallelism placement and reduction strategies for deep learning on hierarchical systems, resulting in synthesized programs that outperform default all-reduce implementations by up to 2.04x on average.

We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping. We experimentally verify the substantial effect of these mappings on all-reduce performance (up to 448x). We offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way. For 69% of parallelism placements and user requested reductions, our framework synthesizes programs that outperform the default all-reduce implementation when evaluated on different GPU hierarchies (max 2.04x, average 1.27x). We complement our synthesis tool with a simulator exceeding 90% top-10 accuracy, which therefore reduces the need for massive evaluations of synthesis results to determine a small set of optimal programs and mappings.

View on arXiv PDF

Similar