OCOct 29, 2020
A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max ProblemsJiawei Zhang, Peijun Xiao, Ruoyu Sun et al.
Nonconvex-concave min-max problem arises in many machine learning applications including minimizing a pointwise maximum of a set of nonconvex functions and robust adversarial training of neural networks. A popular approach to solve this problem is the gradient descent-ascent (GDA) algorithm which unfortunately can exhibit oscillation in case of nonconvexity. In this paper, we introduce a "smoothing" scheme which can be combined with GDA to stabilize the oscillation and ensure convergence to a stationary solution. We prove that the stabilized GDA algorithm can achieve an $O(1/ε^2)$ iteration complexity for minimizing the pointwise maximum of a finite collection of nonconvex functions. Moreover, the smoothed GDA algorithm achieves an $O(1/ε^4)$ iteration complexity for general nonconvex-concave problems. Extensions of this stabilized GDA algorithm to multi-block cases are presented. To the best of our knowledge, this is the first algorithm to achieve $O(1/ε^2)$ for a class of nonconvex-concave problem. We illustrate the practical efficiency of the stabilized GDA algorithm on robust training.
OCJun 19, 2020
DEED: A General Quantization Scheme for Communication Efficiency in BitsTian Ye, Peijun Xiao, Ruoyu Sun
In distributed optimization, a popular technique to reduce communication is quantization. In this paper, we provide a general analysis framework for inexact gradient descent that is applicable to quantization schemes. We also propose a quantization scheme Double Encoding and Error Diminishing (DEED). DEED can achieve small communication complexity in three settings: frequent-communication large-memory, frequent-communication small-memory, and infrequent-communication (e.g. federated learning). More specifically, in the frequent-communication large-memory setting, DEED can be easily combined with Nesterov's method, so that the total number of bits required is $\tilde{O}( \sqrtκ \log 1/ε)$, where $\tilde{O}$ hides numerical constant and $\log κ$ factors. In the frequent-communication small-memory setting, DEED combined with SGD only requires $\tilde{O}( κ\log 1/ε)$ number of bits in the interpolation regime. In the infrequent communication setting, DEED combined with Federated averaging requires a smaller total number of bits than Federated Averaging. All these algorithms converge at the same rate as their non-quantized versions, while using a smaller number of bits.
OCOct 10, 2019
Understanding Limitation of Two Symmetrized Orders by Worst-case ComplexityPeijun Xiao, Zhisheng Xiao, Ruoyu Sun
Update order is one of the major design choices of block decomposition algorithms. There are at least two classes of deterministic update orders: nonsymmetric (e.g. cyclic order) and symmetric (e.g. Gaussian back substitution or symmetric Gauss-Seidel). Recently, Coordinate Descent (CD) with cyclic order was shown to be $O(n^2)$ times slower than randomized versions in the worst-case. A natural question arises: can the symmetrized orders achieve faster convergence rates than the cyclic order, or even getting close to the randomized versions? In this paper, we give a negative answer to this question. We show that both Gaussian back substitution (GBS) and symmetric Gauss-Seidel (sGS) suffer from the same slow convergence issue as the cyclic order in the worst case. In particular, we prove that for unconstrained problems, both GBS-CD and sGS-CD can be $O(n^2)$ times slower than R-CD. Despite unconstrained problems, we also empirically study linearly constrained problems with quadratic objective: we empirically demonstrate that the convergence speed of GBS-ADMM and sGS-ADMM can be roughly $O(n^2)$ times slower than randomly permuted ADMM.