An Isometric Stochastic Optimizer
This work addresses optimization challenges in deep learning, but it is incremental as it builds on Adam with a specific invariance property.
The authors tackled the problem of explaining Adam's success and derived a new optimizer, Iso, based on the principle of making parameter step sizes independent of other parameters' norms, with IsoAdam achieving a speedup over Adam in training a small Transformer.
The Adam optimizer is the standard choice in deep learning applications. I propose a simple explanation of Adam's success: it makes each parameter's step size independent of the norms of the other parameters. Based on this principle I derive Iso, a new optimizer which makes the norm of a parameter's update invariant to the application of any linear transformation to its inputs and outputs. I develop a variant of Iso called IsoAdam that allows optimal hyperparameters to be transferred from Adam, and demonstrate that IsoAdam obtains a speedup over Adam when training a small Transformer.