LGOCOct 4, 2021

Stochastic Anderson Mixing for Nonconvex Stochastic Optimization

arXiv:2110.01543v126 citations
Originality Incremental advance
AI Analysis

This work addresses the theoretical gap in applying Anderson mixing to machine learning optimization, offering a method with proven convergence for nonconvex problems, though it appears incremental as an adaptation of an existing technique.

The authors tackled the problem of unclear convergence theory for Anderson mixing in machine learning by proposing Stochastic Anderson Mixing (SAM) with damped projection and adaptive regularization for nonconvex stochastic optimization. They established convergence guarantees and demonstrated improved performance in training neural networks like CNNs and RNNs on image classification and language modeling tasks.

Anderson mixing (AM) is an acceleration method for fixed-point iterations. Despite its success and wide usage in scientific computing, the convergence theory of AM remains unclear, and its applications to machine learning problems are not well explored. In this paper, by introducing damped projection and adaptive regularization to classical AM, we propose a Stochastic Anderson Mixing (SAM) scheme to solve nonconvex stochastic optimization problems. Under mild assumptions, we establish the convergence theory of SAM, including the almost sure convergence to stationary points and the worst-case iteration complexity. Moreover, the complexity bound can be improved when randomly choosing an iterate as the output. To further accelerate the convergence, we incorporate a variance reduction technique into the proposed SAM. We also propose a preconditioned mixing strategy for SAM which can empirically achieve faster convergence or better generalization ability. Finally, we apply the SAM method to train various neural networks including the vanilla CNN, ResNets, WideResNet, ResNeXt, DenseNet and RNN. Experimental results on image classification and language model demonstrate the advantages of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes