LGOCApr 25, 2025

Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime

arXiv:2504.18208v24 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the challenge of convergence guarantees for high-dimensional neural network training, offering a method to improve efficiency in learning feature distributions, though it appears incremental as it builds on separable nonlinear least squares and existing PDE results.

The paper tackles the non-convex optimization problem of training mean-field single-hidden-layer neural networks by proposing a Variable Projection algorithm that reduces it to training nonlinear features, showing in a teacher-student scenario that this leads to provable convergence rates for sampling a teacher feature distribution, with the feature distribution dynamics described by a weighted ultra-fast diffusion equation.

We study the convergence of gradient methods for the training of mean-field single-hidden-layer neural networks with square loss. For this high-dimensional and non-convex optimization problem, most known convergence results are either qualitative or rely on a neural tangent kernel analysis where nonlinear representations of the data are fixed. Using that this problem belongs to the class of separable nonlinear least squares problems, we consider here a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of nonlinear features. In a teacher-student scenario, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Recent results on the asymptotic behavior of such PDEs then give quantitative guarantees for the convergence of the learned feature distribution.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes