The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations
This provides a theoretical foundation for feature learning in deep learning, addressing a central problem for researchers by unifying empirical observations with rigorous analysis.
The paper tackles the challenge of understanding how neural networks learn representations by deriving the Features at Convergence Theorem (FACT) from first principles, which achieves greater agreement with learned features and explains phenomena like grokking and phase transitions.
It is a central challenge in deep learning to understand how neural networks learn representations. A leading approach is the Neural Feature Ansatz (NFA) (Radhakrishnan et al. 2024), a conjectured mechanism for how feature learning occurs. Although the NFA is empirically validated, it is an educated guess and lacks a theoretical basis, and thus it is unclear when it might fail, and how to improve it. In this paper, we take a first-principles approach to understanding why this observation holds, and when it does not. We use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), an alternative to the NFA that (a) obtains greater agreement with learned features at convergence, (b) explains why the NFA holds in most settings, and (c) captures essential feature learning phenomena in neural networks such as grokking behavior in modular arithmetic and phase transitions in learning sparse parities, similarly to the NFA. Thus, our results unify theoretical first-order optimality analyses of neural networks with the empirically-driven NFA literature, and provide a principled alternative that provably and empirically holds at convergence.