CLJan 3, 2021

An Efficient Transformer Decoder with Compressed Sub-layers

arXiv:2101.00542v432 citations
Originality Highly original
AI Analysis

This work provides a more efficient Transformer decoder, which is beneficial for researchers and practitioners working with large language models and machine translation, offering a substantial speedup without performance degradation.

This paper addresses the computational inefficiency of Transformer decoders by simplifying their architecture. They propose a Compressed Attention Network, which achieves a 1.42x speedup on 14 WMT machine translation tasks while maintaining performance on par with a strong baseline.

The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes