Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime
This addresses the challenge of convergence guarantees for high-dimensional neural network training, offering a method to improve efficiency in learning feature distributions, though it appears incremental as it builds on separable nonlinear least squares and existing PDE results.
The paper tackles the non-convex optimization problem of training mean-field single-hidden-layer neural networks by proposing a Variable Projection algorithm that reduces it to training nonlinear features, showing in a teacher-student scenario that this leads to provable convergence rates for sampling a teacher feature distribution, with the feature distribution dynamics described by a weighted ultra-fast diffusion equation.
We study the convergence of gradient methods for the training of mean-field single-hidden-layer neural networks with square loss. For this high-dimensional and non-convex optimization problem, most known convergence results are either qualitative or rely on a neural tangent kernel analysis where nonlinear representations of the data are fixed. Using that this problem belongs to the class of separable nonlinear least squares problems, we consider here a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of nonlinear features. In a teacher-student scenario, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Recent results on the asymptotic behavior of such PDEs then give quantitative guarantees for the convergence of the learned feature distribution.