A geometric interpretation of stochastic gradient descent using diffusion metrics
This work provides a theoretical interpretation of SGD for researchers in optimization and machine learning, but it appears incremental as it builds on existing geometric frameworks without demonstrating practical improvements.
The authors tackled the problem of understanding the geometric significance of stochastic gradient descent (SGD) by developing a deterministic model that describes SGD trajectories as geodesics of diffusion-based metrics, linking it to General Relativity analogies.
Stochastic gradient descent (SGD) is a key ingredient in the training of deep neural networks and yet its geometrical significance appears elusive. We study a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from the diffusion matrix. These metrics encode information about the highly non-isotropic gradient noise in SGD. We establish a parallel with General Relativity models, where the role of the electromagnetic field is played by the gradient of the loss function. We compute an example of a two layer network.