LG OCSep 15, 2016

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang

arXiv:1609.04836v255.23501 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a key practical problem for deep learning practitioners by explaining the generalization gap in large-batch training, though it is incremental in building on known observations about sharp vs. flat minima.

The paper investigates why large-batch training in deep learning leads to poorer generalization compared to small-batch methods, finding that large batches converge to sharp minima which harm generalization, while small batches converge to flat minima due to gradient noise.

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

View on arXiv PDF Code

Similar