MLLGFeb 18, 2023

Why is parameter averaging beneficial in SGD? An objective smoothing perspective

arXiv:2302.09376v21 citationsh-index: 20
AI Analysis

This work addresses the problem of understanding and improving generalization in deep learning for practitioners, though it is incremental as it builds on prior smoothing theories.

The paper investigates why parameter averaging in SGD improves generalization by proving that averaged SGD optimizes a smoothed objective that avoids sharp local minima, and shows experimentally that appropriate step sizes lead to significant performance gains.

It is often observed that stochastic gradient descent (SGD) and its variants implicitly select a solution with good generalization performance; such implicit bias is often characterized in terms of the sharpness of the minima. Kleinberg et al. (2018) connected this bias with the smoothing effect of SGD which eliminates sharp local minima by the convolution using the stochastic gradient noise. We follow this line of research and study the commonly-used averaged SGD algorithm, which has been empirically observed in Izmailov et al. (2018) to prefer a flat minimum and therefore achieves better generalization. We prove that in certain problem settings, averaged SGD can efficiently optimize the smoothed objective which avoids sharp local minima. In experiments, we verify our theory and show that parameter averaging with an appropriate step size indeed leads to significant improvement in the performance of SGD.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes