Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation
This provides theoretical insights into the learning dynamics of neural networks for researchers in machine learning theory, though it is incremental as it builds on existing teacher-student models.
The paper tackles the problem of understanding the generalization behavior of wide shallow neural networks near the interpolation threshold, revealing a discontinuous phase transition between a 'universal' phase with slow error decay and a 'specialisation' phase with faster decay dependent on weight distributions.
We consider a teacher-student model of supervised learning with a fully-trained two-layer neural network whose width $k$ and input dimension $d$ are large and proportional. We provide an effective theory for approximating the Bayes-optimal generalisation error of the network for any activation function in the regime of sample size $n$ scaling quadratically with the input dimension, i.e., around the interpolation threshold where the number of trainable parameters $kd+k$ and of data $n$ are comparable. Our analysis tackles generic weight distributions. We uncover a discontinuous phase transition separating a "universal" phase from a "specialisation" phase. In the first, the generalisation error is independent of the weight distribution and decays slowly with the sampling rate $n/d^2$, with the student learning only some non-linear combinations of the teacher weights. In the latter, the error is weight distribution-dependent and decays faster due to the alignment of the student towards the teacher network. We thus unveil the existence of a highly predictive solution near interpolation, which is however potentially hard to find by practical algorithms.