LGAICVMLMay 9, 2019

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

arXiv:1905.03776v161 citations
Originality Incremental advance
AI Analysis

This provides empirical insights into hyper-parameter tuning for deep learning practitioners, but it is incremental as it builds on existing understanding of SGD and network width.

The study investigated how over-parameterization affects stochastic gradient descent (SGD) and generalization, finding that optimal SGD hyper-parameters depend on a 'normalized noise scale' proportional to network width, with wider networks achieving higher test accuracy across MLPs, ConvNets, and ResNets.

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes