Higher Order Linear Transformer
This work addresses efficiency issues in transformer models for machine learning practitioners, but it is incremental as it builds on existing linear transformer methods.
The paper tackles the computational complexity of attention mechanisms by extending a linear transformer approach to a second-order approximation of softmax normalization, resulting in improved efficiency.
Following up on the linear transformer part of the article from Katharopoulos et al., that takes this idea from Shen et al., the trick that produces a linear complexity for the attention mechanism is re-used and extended to a second-order approximation of the softmax normalization.