LGFeb 26, 2021

Noisy Truncated SGD: Optimization and Generalization

Yingxue Zhou, Xinyan Li, Arindam Banerjee

arXiv:2103.00075v24.44 citations

Originality Incremental advance

AI Analysis

This work addresses optimization efficiency and generalization in deep learning, offering a method that reduces gradient sparsity while maintaining performance, though it is incremental as it builds on existing SGD variants.

The paper tackles the problem of optimizing over-parameterized deep learning models by proposing Noisy Truncated SGD (NT-SGD), which truncates small gradient components and adds Gaussian noise, showing it matches vanilla SGD's convergence rate and improves generalization with better error bounds and saddle point escape.

Recent empirical work on stochastic gradient descent (SGD) applied to over-parameterized deep learning has shown that most gradient components over epochs are quite small. Inspired by such observations, we rigorously study properties of Truncated SGD (T-SGD), that truncates the majority of small gradient components to zeros. Considering non-convex optimization problems, we show that the convergence rate of T-SGD matches the order of vanilla SGD. We also establish the generalization error bound for T-SGD. Further, we propose Noisy Truncated SGD (NT-SGD), which adds Gaussian noise to the truncated gradients. We prove that NT-SGD has the same convergence rate as T-SGD for non-convex optimization problems. We demonstrate that with the help of noise, NT-SGD can provably escape from saddle points and requires less noise compared to previous related work. We also prove that NT-SGD achieves better generalization error bound compared to T-SGD because of the noise. Our generalization analysis is based on uniform stability and we show that additional noise in the gradient update can boost the stability. Our experiments on a variety of benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100) with various networks (VGG and ResNet) validate the theoretical properties of NT-SGD, i.e., NT-SGD matches the speed and accuracy of vanilla SGD while effectively working with sparse gradients, and can successfully escape poor local minima.

View on arXiv PDF

Similar