Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre, Mark Rofin, Nicolas Flammarion

arXiv:2603.06248v111.5h-index: 16

Predicted impact top 26% in LG · last 90 daysOriginality Incremental advance

AI Analysis

This work provides theoretical insights into the training dynamics of transformers, explaining empirical phenomena such as attention sinks and massive activations for researchers and practitioners working with these models.

This paper analyzes the gradient flow dynamics of the value-softmax model, a core component of self-attention, and finds that it inherently drives optimization towards low-entropy softmax outputs. This polarizing effect is universal across different objectives like logistic and square loss.

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} σ(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

View on arXiv PDF

Similar