To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions
This work provides theoretical insights into gradient clipping for high-dimensional optimization, addressing a practical method with limited prior understanding, though it is incremental in scope.
The authors analyzed gradient clipping in high-dimensional least squares under streaming SGD, finding that clipping cannot improve performance with Gaussian noise but can benefit other noisy settings with proper threshold tuning, and they proposed a one-parameter heuristic for optimal scheduling.
The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss and demonstrate that this equation predicts the path of clipped SGD on synthetic, CIFAR10, and Wikitext2 data. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. We propose a simple heuristic for near optimal scheduling of the clipping threshold which requires the tuning of only one hyperparameter. We conclude with a discussion about the links between high-dimensional clipping and neural network training.