ML LGJun 3

Flatness and Generalization: Learning Multi-Index Models with Homogeneous Neural Networks

Harsh Vardhan, Hossein Taheri, Arya Mazumdar

arXiv:2606.0442994.0

AI Analysis

It resolves a long-standing debate about flatness heuristics for neural network generalization by proving a direct link for a realistic class of models and activations.

The paper establishes a connection between flatness (measured by Hessian trace) and generalization for 2-layer homogeneous neural networks learning multi-index models, showing that flattest interpolators generalize well under low approximation error and label noise, while non-generalizing interpolators cannot be made flatter via symmetries.

A common heuristic used to explain the generalization of first-order gradient methods on non-convex neural networks is that "flat interpolators generalize well" (Hochreiter and Schmidhuber, 1994; Keskar et al., 2017), where flatness can be measured by the trace of the Hessian of the empirical loss. However, Dinh et al. 2017) showed that, using symmetry of the network that can change flatness while keeping the population and empirical losses unchanged, any interpolator can be made sharper or flatter. This result makes the earlier heuristic statement vacuous. In this paper, we show that for learning an unknown multi-index model with $2$-layer non-convex homogeneous neural networks, there is a connection between flatness and generalization, despite the existence of symmetries. This connection pertains to the "flattest" interpolators, i.e., the interpolators that have orderwise minimum flatness among all interpolators. First, we show that there exists a natural class of non-generalizing interpolators whose flatness cannot be made closer to the flattest possible, even using symmetries. Second, we show that for data generated by a sum of single-index models, if the approximation error and label noise are low, any flattest interpolator achieves small population loss, i.e., the flattest interpolators always generalize. This establishes a direct link between flatness and generalization which applies to a large class of activations and realistic data distributions.

View on arXiv PDF

Similar