LG AIJan 26, 2025

StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel

Dylan Cutler, Arun Kandoor, Nishanth Dikkala, Nikunj Saunshi, Xin Wang, Rina Panigrahy

arXiv:2501.15665v29.42 citationsh-index: 17

Originality Incremental advance

AI Analysis

This addresses the decoding latency problem for users of large language models, though it appears incremental as it modifies existing Transformer structures without a paradigm shift.

The paper tackles the sequential bottleneck in Transformer decoding by proposing StagFormer, a new architecture that staggers execution along the sequence axis to enable parallelization across model layers, achieving quality-neutral decoding with potential speedup in simulations.

Decoding in a Transformer based language model is inherently sequential as a token's embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new architecture StagFormer (Staggered Transformer), which staggers execution along the sequence axis and thereby enables parallelizing the decoding process along the depth of the model. We achieve this by breaking the dependency of the token representation at time step $i$ in layer $l$ upon the representations of tokens until time step $i$ from layer $l-1$. Instead, we stagger the execution and only allow a dependency on token representations until time step $i-1$. The later sections of the Transformer still get access to the "rich" representations from the prior section but only from those token positions which are one time step behind. StagFormer allows for different sections of the model to be executed in parallel yielding a potential speedup in decoding while being quality neutral in our simulations. We also explore many natural extensions of this idea. We present how weight-sharing across the different sections being staggered can be more practical in settings with limited memory. We explore the efficacy of using a bounded window attention to pass information from one section to another which helps drive further latency gains for some applications. We also explore the scalability of the staggering idea over more than 2 sections of the Transformer. Finally, we show how one can approximate a recurrent model during inference using weight-sharing. This variant can lead to substantial gains in quality for short generations while being neutral in its latency impact.

View on arXiv PDF

Similar