ARLGJun 28, 2024

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

arXiv:2406.19580v21 citations
Originality Incremental advance
AI Analysis

This addresses communication bottlenecks for researchers and engineers using wafer-scale systems to accelerate distributed DNN training, representing an incremental advance in interconnect design.

The paper tackles the problem of inefficient communication in wafer-scale distributed training of DNN models by proposing FRED, a flexible interconnect that supports various parallelization strategies and in-switch collective communication, resulting in average training time improvements of 1.34X to 1.87X for models like ResNet-152 and GPT-3 compared to a baseline 2D-Mesh fabric.

Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes