H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
This work addresses the computational bottleneck in Transformers for long sequences, offering significant efficiency gains for natural language and vision tasks.
The authors tackled the problem of inefficient attention computation in Transformers by proposing a hierarchical attention mechanism with linear runtime and memory complexity. Their method achieved over +6 points average improvement on the Long Range Arena benchmark and set a new SOTA test perplexity on the One-Billion Word dataset with 5x fewer parameters.
We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.