LGOCMLOct 12, 2021

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

arXiv:2110.06256v216 citations
Originality Highly original
AI Analysis

This addresses a foundational problem in machine learning optimization theory by reconciling empirical observations with theoretical models, offering a new framework for understanding training dynamics.

The paper tackles the disconnect between theoretical analyses of gradient-based algorithms and practical deep neural network training by showing that weights do not converge to stationary points, yet training loss stabilizes, using evidence from models like ResNet101 and TransformerXL. It proposes an ergodic theory perspective, proving convergence of weight distributions to an approximate invariant measure to explain this phenomenon.

This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes