MLAIITLGSTNov 19, 2025

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

arXiv:2511.15120v14 citations
Originality Highly original
AI Analysis

This provides a theoretical foundation for understanding representation learning in deep learning, showing optimal efficiency in a canonical setup, though it is incremental as it builds on existing models and methods.

The paper tackles the problem of how neural networks efficiently learn high-dimensional features by proving that a two-layer network trained with gradient descent can agnostically learn a Gaussian Multi-index model with near-zero test error using sample and time complexities that match the information-theoretic limit, specifically O(d) samples and O(d^2) time up to logarithmic factors.

In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes