LGDec 23, 2025

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

arXiv:2512.20607v17 citationsh-index: 3
Originality Highly original
AI Analysis

This provides a unifying theoretical explanation for a widely observed phenomenon in neural network training, addressing a foundational gap in understanding learning dynamics across architectures.

The authors tackled the problem of explaining why neural networks learn increasingly complex solutions over time, known as simplicity bias, by developing a theoretical framework based on saddle-to-saddle dynamics that applies to various architectures, showing that gradient descent progressively learns solutions with more hidden units, kernels, or heads.

Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes