ML LGJun 15, 2017

Stochastic Training of Neural Networks via Successive Convex Approximations

arXiv:1706.04769v11.81 citations

Originality Highly original

AI Analysis

This addresses the challenge of efficient and scalable neural network training for machine learning practitioners, offering a novel optimization approach with potential for parallelization.

The paper tackles the problem of training neural networks by proposing a new family of algorithms based on successive convex approximations, which iteratively replace the non-convex optimization with simpler convex problems using first-order information. The results show that the algorithm outperforms state-of-the-art techniques, providing faster convergence to a better minimum on medium-sized benchmarks and a large-scale dataset.

This paper proposes a new family of algorithms for training neural networks (NNs). These are based on recent developments in the field of non-convex optimization, going under the general name of successive convex approximation (SCA) techniques. The basic idea is to iteratively replace the original (non-convex, highly dimensional) learning problem with a sequence of (strongly convex) approximations, which are both accurate and simple to optimize. Differently from similar ideas (e.g., quasi-Newton algorithms), the approximations can be constructed using only first-order information of the neural network function, in a stochastic fashion, while exploiting the overall structure of the learning problem for a faster convergence. We discuss several use cases, based on different choices for the loss function (e.g., squared loss and cross-entropy loss), and for the regularization of the NN's weights. We experiment on several medium-sized benchmark problems, and on a large-scale dataset involving simulated physical data. The results show how the algorithm outperforms state-of-the-art techniques, providing faster convergence to a better minimum. Additionally, we show how the algorithm can be easily parallelized over multiple computational units without hindering its performance. In particular, each computational unit can optimize a tailored surrogate function defined on a randomly assigned subset of the input variables, whose dimension can be selected depending entirely on the available computational power.

View on arXiv PDF

Similar