DCLGMar 15, 2024

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

arXiv:2403.10266v516 citationsh-index: 13
Originality Highly original
AI Analysis

This addresses the challenge of large memory and slow speeds in multi-dimensional transformers for domains requiring long sequences, offering a novel solution beyond incremental improvements.

The paper tackles the problem of scaling multi-dimensional transformers to long sequences by proposing Dynamic Sequence Parallelism (DSP), which dynamically switches parallel dimensions to reduce communication overhead, resulting in throughput improvements of 32.2% to 10x with less than 25% communication volume.

Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes