CLLGMay 11, 2021

EL-Attention: Memory Efficient Lossless Attention for Generation

arXiv:2105.04779v29 citations
AI Analysis

This addresses memory and speed bottlenecks for users of Transformer-based models in generation tasks like summarization and question generation, offering a practical improvement.

The paper tackles the memory inefficiency of caching intermediate results in Transformer multi-head attention during generation tasks, proposing EL-attention, which eliminates the need for cache and speeds up models by 1.6x to 5.3x without accuracy loss.

Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes