OC LGFeb 26, 2021

Cyclic Coordinate Dual Averaging with Extrapolation

arXiv:2102.13244v47.011 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a fundamental gap in optimization theory for practitioners in statistical learning, providing theoretical justification for widely used methods, though it is incremental in nature.

The authors tackled the problem of understanding and improving the convergence of cyclic block coordinate methods for variational inequality problems, achieving convergence bounds that match optimal full gradient methods with a potentially much smaller gradient Lipschitz constant, up to a factor of √m for m blocks, and introduced a variance-reduced variant for finite-sum structures with better rates in some regimes.

Cyclic block coordinate methods are a fundamental class of optimization methods widely used in practice and implemented as part of standard software packages for statistical learning. Nevertheless, their convergence is generally not well understood and so far their good practical performance has not been explained by existing convergence analyses. In this work, we introduce a new block coordinate method that applies to the general class of variational inequality (VI) problems with monotone operators. This class includes composite convex optimization problems and convex-concave min-max optimization problems as special cases and has not been addressed by the existing work. The resulting convergence bounds match the optimal convergence bounds of full gradient methods, but are provided in terms of a novel gradient Lipschitz condition w.r.t.~a Mahalanobis norm. For $m$ coordinate blocks, the resulting gradient Lipschitz constant in our bounds is never larger than a factor $\sqrt{m}$ compared to the traditional Euclidean Lipschitz constant, while it is possible for it to be much smaller. Further, for the case when the operator in the VI has finite-sum structure, we propose a variance reduced variant of our method which further decreases the per-iteration cost and has better convergence rates in certain regimes. To obtain these results, we use a gradient extrapolation strategy that allows us to view a cyclic collection of block coordinate-wise gradients as one implicit gradient.

View on arXiv PDF Code

Similar