LGMay 27, 2022

Transformers from an Optimization Perspective

arXiv:2205.13891v240 citationsh-index: 45
Originality Incremental advance
AI Analysis

This work provides an interpretable optimization perspective for Transformers, which could aid understanding and inspire new model designs, though it is incremental in extending similar analyses from simpler models.

The paper tackles the problem of finding an energy function underlying the Transformer model, such that its minimization corresponds to the Transformer forward pass, and demonstrates a close association between energy minimization and self-attention layers for the first time.

Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can view Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes