LGCLCVOct 13, 2021

Leveraging redundancy in attention with Reuse Transformers

arXiv:2110.06821v143 citations
Originality Incremental advance
AI Analysis

This addresses efficiency issues in Transformers for language and vision applications, offering a method to reduce computational costs without sacrificing performance.

The paper tackled the redundancy of attention scores across layers in Transformers by proposing Reuse Transformers, which reuse attention scores from one layer in subsequent layers, achieving equivalent or better performance on standard benchmarks while reducing compute and memory usage.

Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers. We systematically analyze the empirical similarity of these scores across heads and layers and find them to be considerably redundant, especially adjacent layers showing high similarity. Motivated by these findings, we propose a novel architecture that reuses attention scores computed in one layer in multiple subsequent layers. Experiments on a number of standard benchmarks show that reusing attention delivers performance equivalent to or better than standard transformers, while reducing both compute and memory usage.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes