DCAILGMay 30, 2021

Tesseract: Parallelize the Tensor Parallelism Efficiently

arXiv:2105.14500v253 citations
Originality Incremental advance
AI Analysis

This work addresses the bottleneck of high communication costs and low scaling efficiency in distributed training for large models, offering an incremental improvement over existing tensor parallelism techniques.

The paper tackles the problem of training large deep learning models efficiently on limited GPU memory by proposing Tesseract, a scalable tensor parallelism method that reduces communication overhead and memory requirements, achieving speedups of up to 1.53x in strong scaling and 4.0x in inference speedup compared to previous methods.

Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it impossible to fit large models into a single GPU or even a GPU server. Besides, it is highly necessary to reduce the training time for large models. Previous methods like Megatron-LM implemented a 1-Dimensional distributed method to use GPUs to speed up the training. However, these methods have a high communication overhead and a low scaling efficiency on large-scale clusters. To solve these problems, we propose Tesseract, a highly scalable tensor parallelism with a novel design. It increases efficiency by reducing communication overhead and lowers the memory required for each GPU. By introducing the novel dimension into tensor parallelism, Tesseract greatly increases the memory capacity of tensor parallelism. Concretely, this new dimension furthermore increases the degree of tensor parallelism. Compared to previous 1-D and 2-D methods, Tesseract manages to reduce the communication cost on each layer, resulting in speedups of 1.38x and 1.53x respectively with strong scaling. In weak scaling experiments, Tesseract achieves a maximum of 4.0/1.7 times inference speedup and 3.4/1.7 times throughput improvement compared to 1-D/2-D methods, respectively. By introducing Tesseract, we offer a more efficient and scalable way to implement large deep learning models with limited GPU resources.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes