The Effect of Network Width on the Performance of Large-batch Training
This work addresses the challenge of efficient distributed training for machine learning practitioners, offering insights into network design to mitigate large-batch issues, though it is incremental as it builds on existing large-batch training research.
The paper tackles the problem of large-batch training in neural networks, which reduces communication overheads but can harm convergence and generalization. It finds that wider networks, compared to deeper ones with the same parameter count, can be trained with larger batches without slowing convergence, as supported by theoretical results and experiments on residual and fully-connected networks.
Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however, large batches can affect the convergence properties and generalization performance of SGD. In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.