MLLGJun 4, 2024

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

arXiv:2406.02157v12 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency trade-offs in neural network training for researchers and practitioners, offering incremental improvements in optimization protocols.

The paper tackles the problem of optimizing batch size in SGD for training two-layer neural networks on multi-index target functions, showing that large batches up to a dimension-dependent threshold minimize training time without increasing sample complexity, and introduces Correlation loss SGD to overcome limitations of larger batches. They validate these findings with theoretical analysis and numerical experiments.

We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes