On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective
This addresses the problem of understanding optimization dynamics in deep learning for researchers, providing theoretical insights into batch size effects, though it is incremental in building on existing SDE frameworks.
The paper investigates the relationship between large batch training and convergence to sharp minima in stochastic gradient descent, finding that while SGD asymptotically converges to flatter minima regardless of batch size, the convergence rate depends on batch size, with empirical validation across datasets and models.
We study the statistical properties of the dynamic trajectory of stochastic gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD as stochastic differential equations (SDEs). We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on the escaping phenomenon and the relationship with large batch and sharp minima. In particular, we find that the stochastic process solution tends to converge to flatter minima regardless of the batch size in the asymptotic regime. However, the convergence rate is rigorously proven to depend on the batch size. These results are validated empirically with various datasets and models.