LGAINov 16, 2024

MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map

arXiv:2411.10741v117 citationsh-index: 14NIPS
Originality Highly original
AI Analysis

This work addresses a key bottleneck in efficient Transformer design for machine learning practitioners, offering a novel solution with broad applications.

The paper tackles the problem of designing optimal linear approximations to softmax attention in Transformers, proposing MetaLA, which meets three theoretical conditions and outperforms existing models on tasks like language modeling and Long-Range Arena.

Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures. However, the optimal design of these linear models is still an open question. In this work, we attempt to answer this question by finding the best linear approximation to softmax attention from a theoretical perspective. We start by unifying existing linear complexity models as the linear attention form and then identify three conditions for the optimal linear attention design: 1) Dynamic memory ability; 2) Static approximation ability; 3) Least parameter approximation. We find that none of the current linear models meet all three conditions, resulting in suboptimal performance. Instead, we propose Meta Linear Attention (MetaLA) as a solution that satisfies these conditions. Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that MetaLA is more effective than the existing linear models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes