OCLGMay 3, 2025

A dynamic view of some anomalous phenomena in SGD

arXiv:2505.01751v32 citationsh-index: 3Systems & control letters (Print)
Originality Synthesis-oriented
AI Analysis

This work addresses the understanding of training dynamics in machine learning, but it is incremental as it builds on existing theories to explain known phenomena.

The paper tackles the explanation of anomalous phenomena like double descent and grokking in over-parametrized neural networks during SGD training, using two time scale stochastic approximation theory to provide a plausible explanation for these behaviors.

It has been observed by Belkin et al.\ that over-parametrized neural networks exhibit a `double descent' phenomenon. That is, as the model complexity (as reflected in the number of features) increases, the test error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., the test error decreases with the number of iterates, then increases, then decreases again. Another anomalous phenomenon is that of \textit{grokking} wherein two regimes of descent are interrupted by a third regime wherein the mean loss remains almost constant. This note presents a plausible explanation for these and related phenomena by using the theory of two time scale stochastic approximation, applied to the continuous time limit of the gradient dynamics. This gives a novel perspective for an already well studied theme.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes