LGAIDec 17, 2025

How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

arXiv:2512.15115v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work provides a theoretical foundation for designing sequence architectures, addressing a core problem in machine learning for researchers and practitioners, though it is incremental in building on existing models.

The paper tackles the problem of unifying diverse sequence modeling architectures like Transformers and state space models (SSMs) by introducing a framework that reveals trade-offs in expressivity and trainability. It proves that representing a linear SSM with lag operators spanning a k-dimensional subspace requires exactly k heads in attention models, formalizing a fundamental trade-off between algebraic expressivity and long-range gradient propagation.

Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij}(X)$, making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij}(X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as single-head attention, are constrained to a low-dimensional operator span and cannot represent certain structured dynamical maps. Second, we prove an Equivalence (Head-Count) Theorem showing that, within our multi-head factorized class, representing a linear SSM whose lag operators span a $k$-dimensional subspace on length-$n$ sequences requires and is achievable with $H=k$ heads. Third, we prove a Gradient Highway Result, showing that attention layers admit inputs with distance-independent gradient paths, whereas stable linear dynamics exhibit distance-dependent gradient attenuation. Together, these results formalize a fundamental trade-off between algebraic expressivity (interaction/operator span) and long-range gradient propagation, providing theoretical grounding for modern sequence architecture design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes