Generalization Bounds of Stochastic Gradient Descent in Homogeneous Neural Networks
This work provides a theoretical basis for using more aggressive learning rates in SGD for practitioners working with common neural network architectures, potentially improving optimization efficiency.
The paper addresses the generalization analysis of stochastic gradient descent (SGD) in non-convex training, specifically for homogeneous neural networks. It demonstrates that for these networks, a slower stepsize decay of order $\Omega(1/\sqrt{t})$ is sufficient for generalization, in contrast to the typical $\mathcal{O}(1/t)$ requirement.
Algorithmic stability is among the most potent techniques in generalization analysis. However, its derivation usually requires a stepsize $η_t = \mathcal{O}(1/t)$ under non-convex training regimes, where $t$ denotes iterations. This rigid decay of the stepsize potentially impedes optimization and may not align with practical scenarios. In this paper, we derive the generalization bounds under the homogeneous neural network regimes, proving that this regime enables slower stepsize decay of order $Ω(1/\sqrt{t})$ under mild assumptions. We further extend the theoretical results from several aspects, e.g., non-Lipschitz regimes. This finding is broadly applicable, as homogeneous neural networks encompass fully-connected and convolutional neural networks with ReLU and LeakyReLU activations.