Ordinary Least Squares is a Special Case of Transformer
For machine learning theorists, this provides a rigorous algebraic connection between classical statistics (OLS) and modern Transformer architectures, clarifying the fundamental nature of attention.
The paper proves that Ordinary Least Squares (OLS) is a special case of a single-layer Linear Transformer, showing that attention can solve OLS in one forward pass. It also reveals a decoupled slow and fast memory mechanism in Transformers and discusses the evolution from linear to exponential memory capacity.
The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.