LGDIS-NNJun 8, 2023

Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

arXiv:2306.05300v33 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in neural network optimization for researchers and practitioners, revealing that epoch-based training's noise structure may help find flatter minima for better generalization, though it is incremental in refining existing SGD theory.

The paper tackled the assumption of uncorrelated gradient noise in epoch-based Stochastic Gradient Descent (SGD) by showing that the noise is inherently anti-correlated over time, which reduces weight variance in flat directions and leads to a considerable decrease in loss fluctuations.

Stochastic Gradient Descent (SGD) has become a cornerstone of neural network optimization due to its computational efficiency and generalization capabilities. However, the gradient noise introduced by SGD is often assumed to be uncorrelated over time, despite the common practice of epoch-based training where data is sampled without replacement. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum. Our main contributions are twofold: First, we calculate the exact autocorrelation of the noise during epoch-based training under the assumption that the noise is independent of small fluctuations in the weight vector, revealing that SGD noise is inherently anti-correlated over time. Second, we explore the influence of these anti-correlations on the variance of weight fluctuations. We find that for directions with curvature of the loss greater than a hyperparameter-dependent crossover value, the conventional predictions of isotropic weight variance under stationarity, based on uncorrelated and curvature-proportional noise, are recovered. Anti-correlations have negligible effect here. However, for relatively flat directions, the weight variance is significantly reduced, leading to a considerable decrease in loss fluctuations compared to the constant weight variance assumption. Furthermore, we present a numerical experiment where training with these anti-correlations enhances test performance, suggesting that the inherent noise structure induced by epoch-based training may play a role in finding flatter minima that generalize better.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes