DCAINov 26, 2023

Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search

Microsoft
arXiv:2311.15269v19 citationsh-index: 38
Originality Incremental advance
AI Analysis

This addresses performance bottlenecks in distributed deep learning for researchers and practitioners, though it is incremental as it builds on existing scheduling methods.

The paper tackles the challenge of optimizing distributed execution of large DNN models by automating schedule search for diverse operator placement strategies, achieving up to 5.5x training speedup and 38% inference latency reduction.

Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (repetend) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5x training performance speedup and up to 38% inference latency reduction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes