LGAIMar 24

Are Flat Minima an Illusion?

arXiv:2605.0520939.7
Predicted impact top 58% in LG · last 90 daysOriginality Highly original
AI Analysis

For the deep learning community, this challenges the widely held belief that flat minima cause generalization, offering a reparameterization-invariant alternative that may refocus research on function-space properties.

The paper argues that flat minima are not the true cause of generalization in neural networks, showing that function-preserving reparameterization can arbitrarily change Hessian-based flatness without affecting predictions. Instead, it proposes 'weakness'—the volume of parameter-space completions consistent with the learned function—as the invariant driver of generalization, supported by experiments where weakness predicts generalization (ρ=+0.374 on MNIST, ρ=+0.384 on Fashion-MNIST) while sharpness anticorrelates.

Neural networks that land in flat regions of the loss landscape tend to generalise better than those in sharp regions. Sharpness-Aware Minimisation exploits this to improve generalisation. But function-preserving reparameterisation can inflate the Hessian of any minimum by two orders of magnitude without changing a single prediction. If the geometry of weight space can be manufactured from nothing, it cannot be the cause of anything. In other words, flat is simple and simplicity depends on encoding. Here I show that the actual driver is weakness, the volume of completions compatible with the learned function in the learner's embodied language. Weakness is reparameterisation-invariant because it is defined over what the network \emph{does}, not how it is parameterised. I prove weakness is minimax-optimal under exchangeable demands, and that PAC-Bayes bounds work because they correlate with it. On MNIST, the large-batch generalisation advantage \emph{vanishes} as training data grows, from $+1.6\%$ at $n = 2{,}000$ to $+0.02\%$ at $n = 60{,}000$. A quantity whose predictive power depends on how much data you have is not a cause but a confounder. I run head-to-heads on 100 networks with identical architecture and training. For MNIST weakness predicts generalisation ($ρ= +0.374$, $p = 0.00012$), sharpness anticorrelates ($ρ= -0.226$) and simplicity predicts nothing ($p = 0.848$). For Fashion-MNIST ($ρ= +0.384$, $p = 8.15 \times 10^{-5}$), though simplicity is at least somewhat predictive there. Simplicity is dataset dependent, whereas weakness is invariant. Flat minima were never the answer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes