Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix
This work addresses the challenge of improving optimization efficiency in distributed deep learning, though it appears incremental as it builds on existing second-order methods with preliminary results.
The paper tackles the problem of accelerating stochastic gradient descent for distributed deep learning by introducing a novel method to compute a rank m approximation of the inverse Hessian matrix, leveraging gradient and parameter differences across workers to implement a distributed Newton-Raphson approximation, with preliminary results highlighting advantages and challenges of second-order methods in large stochastic optimization.
We introduce a novel method to compute a rank $m$ approximation of the inverse of the Hessian matrix in the distributed regime. By leveraging the differences in gradients and parameters of multiple Workers, we are able to efficiently implement a distributed approximation of the Newton-Raphson method. We also present preliminary results which underline advantages and challenges of second-order methods for large stochastic optimization problems. In particular, our work suggests that novel strategies for combining gradients provide further information on the loss surface.