An overview of gradient descent optimization algorithms
This is an incremental overview aimed at practitioners and researchers in machine learning to improve understanding and application of existing optimization methods.
The paper tackles the problem of gradient descent optimization algorithms being used as black-box optimizers by providing intuitive explanations of their behaviors, strengths, and weaknesses, without presenting new experimental results or concrete numbers.
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.