ML LGNov 20, 2015

Variance Reduction in SGD by Distributed Importance Sampling

Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, Yoshua Bengio

arXiv:1511.06481v731.3224 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient distributed deep learning by reducing synchronization costs and improving training stability, though it is incremental as it builds on existing importance sampling methods.

The paper tackles the problem of high gradient variance in distributed stochastic gradient descent (SGD) by proposing a framework where workers search for informative examples in parallel and a single worker updates the model using importance sampling, resulting in reduced gradient variance as shown experimentally.

Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.

View on arXiv PDF

Similar