Asymptotics of SGD in Sequence-Single Index Models and Single-Layer Attention Networks
This provides a rigorous foundation for understanding how sequential structure benefits learning in attention-based models, though it is incremental as it builds on classical single-index models.
The paper tackled the dynamics of stochastic gradient descent (SGD) for Sequence Single-Index models, which generalize single-index models to sequences and include simplified attention networks, by deriving a closed-form expression for population loss and characterizing SGD dynamics to reveal two training phases and the influence of sequence length and positional encoding on convergence.
We study the dynamics of stochastic gradient descent (SGD) for a class of sequence models termed Sequence Single-Index (SSI) models, where the target depends on a single direction in input space applied to a sequence of tokens. This setting generalizes classical single-index models to the sequential domain, encompassing simplified one-layer attention architectures. We derive a closed-form expression for the population loss in terms of a pair of sufficient statistics capturing semantic and positional alignment, and characterize the induced high-dimensional SGD dynamics for these coordinates. Our analysis reveals two distinct training phases: escape from uninformative initialization and alignment with the target subspace, and demonstrates how the sequence length and positional encoding influence convergence speed and learning trajectories. These results provide a rigorous and interpretable foundation for understanding how sequential structure in data can be beneficial for learning with attention-based models.