LG MLMay 21, 2018

SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

Wei Wen, Yandan Wang, Feng Yan, Cong Xu, Chunpeng Wu, Yiran Chen, Hai Li

arXiv:1805.07898v316.555 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a fundamental issue in deep learning training for researchers and practitioners, offering an incremental improvement over existing noise injection methods.

The paper tackles the problem of sharp minima in deep neural networks, which can lead to poor generalization, especially with large-batch SGD, and proposes SmoothOut, a framework that smooths out sharp minima by perturbing and averaging multiple copies of the network with noise injection, resulting in improved generalization across various experiments.

In Deep Learning, Stochastic Gradient Descent (SGD) is usually selected as a training method because of its efficiency; however, recently, a problem in SGD gains research interest: sharp minima in Deep Neural Networks (DNNs) have poor generalization; especially, large-batch SGD tends to converge to sharp minima. It becomes an open question whether escaping sharp minima can improve the generalization. To answer this question, we propose SmoothOut framework to smooth out sharp minima in DNNs and thereby improve generalization. In a nutshell, SmoothOut perturbs multiple copies of the DNN by noise injection and averages these copies. Injecting noises to SGD is widely used in the literature, but SmoothOut differs in lots of ways: (1) a de-noising process is applied before parameter updating; (2) noise strength is adapted to filter norm; (3) an alternative interpretation on the advantage of noise injection, from the perspective of sharpness and generalization; (4) usage of uniform noise instead of Gaussian noise. We prove that SmoothOut can eliminate sharp minima. Training multiple DNN copies is inefficient, we further propose an unbiased stochastic SmoothOut which only introduces the overhead of noise injecting and de-noising per batch. An adaptive variant of SmoothOut, AdaSmoothOut, is also proposed to improve generalization. In a variety of experiments, SmoothOut and AdaSmoothOut consistently improve generalization in both small-batch and large-batch training on the top of state-of-the-art solutions.

View on arXiv PDF Code

Similar