OCLGDec 4, 2023

Unlocking optimal batch size schedules using continuous-time control and perturbation theory

arXiv:2312.01898v14 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses hyperparameter tuning for SGD, a foundational optimization method in machine learning, but appears incremental as it builds on prior studies of variable batch sizes.

The authors tackled the problem of determining optimal batch size schedules for Stochastic Gradient Descent (SGD) by theoretically deriving schedules up to a quadratic error in the learning rate, applying the results to linear regression.

Stochastic Gradient Descent (SGD) and its variants are almost universally used to train neural networks and to fit a variety of other parametric models. An important hyperparameter in this context is the batch size, which determines how many samples are processed before an update of the parameters occurs. Previous studies have demonstrated the benefits of using variable batch sizes. In this work, we will theoretically derive optimal batch size schedules for SGD and similar algorithms, up to an error that is quadratic in the learning rate. To achieve this, we approximate the discrete process of parameter updates using a family of stochastic differential equations indexed by the learning rate. To better handle the state-dependent diffusion coefficient, we further expand the solution of this family into a series with respect to the learning rate. Using this setup, we derive a continuous-time optimal batch size schedule for a large family of diffusion coefficients and then apply the results in the setting of linear regression.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes