LGDIS-NNNEMLMay 26, 2025

Recurrent Self-Attention Dynamics: An Energy-Agnostic Perspective from Jacobians

arXiv:2505.19458v44 citationsh-index: 2
Originality Incremental advance
AI Analysis

This provides a theoretical foundation for analyzing general self-attention architectures, which is incremental but useful for researchers in deep learning and dynamical systems.

The paper tackles the problem of understanding self-attention dynamics without relying on energy-based assumptions, showing that analyzing the Jacobian matrix reveals the normalization layer's role in suppressing Lipschitzness and complex eigenvalues, and that criticality from Lyapunov exponents correlates with high inference performance.

The theoretical understanding of self-attention (SA) has been steadily progressing. A prominent line of work studies a class of SA layers that admit an energy function decreased by state updates. While it provides valuable insights into inherent biases in signal propagation, it often relies on idealized assumptions or additional constraints not necessarily present in standard SA. Thus, to broaden our understanding, this work aims to relax these energy constraints and provide an energy-agnostic characterization of inference dynamics by dynamical systems analysis. In more detail, we first consider relaxing the symmetry and single-head constraints traditionally required in energy-based formulations. Next, we show that analyzing the Jacobian matrix of the state is highly valuable when investigating more general SA architectures without necessarily admitting an energy function. It reveals that the normalization layer plays an essential role in suppressing the Lipschitzness of SA and the Jacobian's complex eigenvalues, which correspond to the oscillatory components of the dynamics. In addition, the Lyapunov exponents computed from the Jacobians demonstrate that the normalized dynamics lie close to a critical state, and this criticality serves as a strong indicator of high inference performance. Furthermore, the Jacobian perspective also enables us to develop regularization methods for training and a pseudo-energy for monitoring inference dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes