A Probabilistic Interpretation of Transformers
This work provides a theoretical foundation for understanding transformers, which could benefit researchers in machine learning and AI, though it is incremental as it builds on prior theories like Hopfield attention.
The authors tackled the problem of interpreting transformer attention mechanisms by proposing a probabilistic framework based on exponential families, showing that the attention sublayer corresponds to a gradient ascent step of the log normalizer, but they also identified theoretical limitations in this and existing theories.
We propose a probabilistic interpretation of exponential dot product attention of transformers and contrastive learning based off of exponential families. The attention sublayer of transformers is equivalent to a gradient ascent step of the log normalizer, which is the log-sum-exp term in the Hopfield theory of attention. This ascent step induces a parallel expansion of points, which is counterbalanced by a contraction from layer normalization. We also state theoretical limitations of our theory and the Hopfield theory and suggest directions for resolution.