LGMLSep 29, 2018

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

arXiv:1810.00122v223 citations
Originality Incremental advance
AI Analysis

This provides incremental theoretical insights into batch normalization's robustness for machine learning practitioners.

The paper tackles the lack of quantitative analysis on how batch normalization affects gradient descent convergence and stability, showing that BNGD converges for arbitrary learning rates with linear convergence under mild conditions and quantifying two sources of acceleration over GD, confirmed by numerical experiments.

Despite its empirical success and recent theoretical progress, there generally lacks a quantitative analysis of the effect of batch normalization (BN) on the convergence and stability of gradient descent. In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS). Since precise dynamical properties of gradient descent (GD) is completely known for the OLS problem, it allows us to isolate and compare the additional effects of BN. More precisely, we show that unlike GD, gradient descent with BN (BNGD) converges for arbitrary learning rates for the weights, and the convergence remains linear under mild conditions. Moreover, we quantify two different sources of acceleration of BNGD over GD -- one due to over-parameterization which improves the effective condition number and another due having a large range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and could account for much of its robustness properties. These findings are confirmed quantitatively by numerical experiments, which further show that many of the uncovered properties of BNGD in OLS are also observed qualitatively in more complex supervised learning problems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes