OCAug 21, 2023
Decentralized Riemannian Conjugate Gradient Method on the Stiefel ManifoldJun Chen, Haishan Ye, Mengmeng Wang et al.
The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than that of second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there is little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and the communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.
LGApr 15, 2023
Stochastic Distributed Optimization under Average Second-order Similarity: Algorithms and AnalysisDachao Lin, Yuze Han, Haishan Ye et al.
We study finite-sum distributed optimization problems involving a master node and $n-1$ local nodes under the popular $δ$-similarity and $μ$-strong convexity conditions. We propose two new algorithms, SVRS and AccSVRS, motivated by previous works. The non-accelerated SVRS method combines the techniques of gradient sliding and variance reduction and achieves a better communication complexity of $\tilde{\mathcal{O}}(n {+} \sqrt{n}δ/μ)$ compared to existing non-accelerated algorithms. Applying the framework proposed in Katyusha X, we also develop a directly accelerated version named AccSVRS with the $\tilde{\mathcal{O}}(n {+} n^{3/4}\sqrt{δ/μ})$ communication complexity. In contrast to existing results, our complexity bounds are entirely smoothness-free and exhibit superiority in ill-conditioned cases. Furthermore, we establish a nearly matched lower bound to verify the tightness of our AccSVRS method.
NAMar 21, 2020
Approximate Newton MethodsHaishan Ye, Luo Luo, Zhihua Zhang
Many machine learning models involve solving optimization problems. Thus, it is important to deal with a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention due to their efficiency at each iteration, rectified a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. Other efficient stochastic second order methods are also proposed. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the performance in real applications. In this paper, we aim to fill these gaps. We propose a unifying framework to analyze both local and global convergence properties of second order methods. Based on this framework, we present our theoretical results which match the performance in real applications well.
LGAug 1, 2023
Mirror Natural Evolution StrategiesHaishan Ye
The zeroth-order optimization has been widely used in machine learning applications. However, the theoretical study of the zeroth-order optimization focus on the algorithms which approximate (first-order) gradients using (zeroth-order) function value difference at a random direction. The theory of algorithms which approximate the gradient and Hessian information by zeroth-order queries is much less studied. In this paper, we focus on the theory of zeroth-order optimization which utilizes both the first-order and second-order information approximated by the zeroth-order queries. We first propose a novel reparameterized objective function with parameters $(μ, Σ)$. This reparameterized objective function achieves its optimum at the minimizer and the Hessian inverse of the original objective function respectively, but with small perturbations. Accordingly, we propose a new algorithm to minimize our proposed reparameterized objective, which we call \texttt{MiNES} (mirror descent natural evolution strategy). We show that the estimated covariance matrix of \texttt{MiNES} converges to the inverse of Hessian matrix of the objective function with a convergence rate $\widetilde{\mathcal{O}}(1/k)$, where $k$ is the iteration number and $\widetilde{\mathcal{O}}(\cdot)$ hides the constant and $\log$ terms. We also provide the explicit convergence rate of \texttt{MiNES} and how the covariance matrix promotes the convergence rate.
LGDec 5, 2022
An Efficient Stochastic Algorithm for Decentralized Nonconvex-Strongly-Concave Minimax OptimizationLesi Chen, Haishan Ye, Luo Luo
This paper studies the stochastic nonconvex-strongly-concave minimax optimization over a multi-agent network. We propose an efficient algorithm, called Decentralized Recursive gradient descEnt Ascent Method (DREAM), which achieves the best-known theoretical guarantee for finding the $ε$-stationary points. Concretely, it requires $\mathcal{O}(\min (κ^3ε^{-3},κ^2 \sqrt{N} ε^{-2} ))$ stochastic first-order oracle (SFO) calls and $\tilde{\mathcal{O}}(κ^2 ε^{-2})$ communication rounds, where $κ$ is the condition number and $N$ is the total number of individual functions. Our numerical experiments also validate the superiority of DREAM over previous methods.
NANov 19, 2015
Accelerating Random Kaczmarz Algorithm Based on Clustering InformationYujun Li, Kaichun Mo, Haishan Ye
Kaczmarz algorithm is an efficient iterative algorithm to solve overdetermined consistent system of linear equations. During each updating step, Kaczmarz chooses a hyperplane based on an individual equation and projects the current estimate for the exact solution onto that space to get a new estimate. Many vairants of Kaczmarz algorithms are proposed on how to choose better hyperplanes. Using the property of randomly sampled data in high-dimensional space, we propose an accelerated algorithm based on clustering information to improve block Kaczmarz and Kaczmarz via Johnson-Lindenstrauss lemma. Additionally, we theoretically demonstrate convergence improvement on block Kaczmarz algorithm.
NAFeb 27, 2017
A Simple Approach to Optimal CUR DecompositionHaishan Ye, Yujun Li, Zhihua Zhang
Prior optimal CUR decomposition and near optimal column reconstruction methods have been established by combining BSS sampling and adaptive sampling. In this paper, we propose a new approach to the optimal CUR decomposition and near optimal column reconstruction by just using leverage score sampling. In our approach, both the BSS sampling and adaptive sampling are not needed. Moreover, our approach is the first $O(\mathrm{nnz}(\A))$ optimal CUR algorithm where $\A$ is a data matrix in question. We also extend our approach to the Nystr{ö}m method, obtaining a fast algorithm which runs $\tilde{O}(n^{2})$ or $O(\mathrm{\nnz}(\A))$
OCOct 25, 2022
On the Complexity of Decentralized Smooth Nonconvex Finite-Sum OptimizationLuo Luo, Yunyan Bai, Lesi Chen et al.
We study the decentralized optimization problem $\min_{{\bf x}\in{\mathbb R}^d} f({\bf x})\triangleq \frac{1}{m}\sum_{i=1}^m f_i({\bf x})$, where the local function on the $i$-th agent has the form of $f_i({\bf x})\triangleq \frac{1}{n}\sum_{j=1}^n f_{i,j}({\bf x})$ and every individual $f_{i,j}$ is smooth but possibly nonconvex. We propose a stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (DEAREST) method, which achieves an $ε$-stationary point at each agent with the communication rounds of $\tilde{\mathcal O}(Lε^{-2}/\sqrtγ\,)$, the computation rounds of $\tilde{\mathcal O}(n+(L+\min\{nL, \sqrt{n/m}\bar L\})ε^{-2})$, and the local incremental first-oracle calls of ${\mathcal O}(mn + {\min\{mnL, \sqrt{mn}\bar L\}}{ε^{-2}})$, where $L$ is the smoothness parameter of the objective function, $\bar L$ is the mean-squared smoothness parameter of all individual functions, and $γ$ is the spectral gap of the mixing matrix associated with the network. We then establish the lower bounds to show that the proposed method is near-optimal. Notice that the smoothness parameters $L$ and $\bar L$ used in our algorithm design and analysis are global, leading to sharper complexity bounds than existing results that depend on the local smoothness. We further extend DEAREST to solve the decentralized finite-sum optimization problem under the Polyak-Łojasiewicz condition, also achieving the near-optimal complexity bounds.
AINov 11, 2025Code
Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language ModelsZhishen Sun, Guang Dai, Ivor Tsang et al.
LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate LLMs' reasoning ability in complex environments by injecting additional semantically irrelevant perturbation sentences and gradually increasing the perturbation intensity. At the same time, we use an additional perturbation method: core questioning instruction missing, to further analyze the LLMs' problem-solving mechanism. The experimental results show that LLMs perform stably when facing perturbation sentences without numbers, but there is also a robustness boundary. As the perturbation intensity increases, the performance exhibits varying degrees of decline; when facing perturbation sentences with numbers, the performance decreases more significantly, most open source models with smaller parameters decrease by nearly or even more than 10%, and further increasing with the enhancement of perturbation intensity, with the maximum decrease reaching 51.55%. Even the most advanced commercial LLMs have seen a 3%-10% performance drop. By analyzing the reasoning process of LLMs in detail, We find that models are more sensitive to perturbations with numerical information and are more likely to give incorrect answers when disturbed by irrelevant numerical information. The higher the perturbation intensity, the more obvious these defects are. At the same time, in the absence of core questioning instruction, models can still maintain an accuracy of 20%-40%, indicating that LLMs may rely on memory templates or pattern matching to complete the task, rather than logical reasoning. In general, our work reveals the shortcomings and limitations of current LLMs in their reasoning capabilities, which is of great significance for the further development of LLMs.
NANov 10, 2015
Fast Spectral Low Rank Matrix ApproximationHaishan Ye, Zhihua Zhang
First, we extend the results of approximate matrix multiplication from the Frobenius norm to the spectral norm. Second, We develop a class of fast approximate generalized linear regression algorithms with respect to the spectral norm. Finally, We give a fast approximate SVD.
LGMar 26
Optimal High-Probability Regret for Online Convex Optimization with Two-Point Bandit FeedbackHaishan Ye
We consider the problem of Online Convex Optimization (OCO) with two-point bandit feedback in an adversarial environment. In this setting, a player attempts to minimize a sequence of adversarially generated convex loss functions, while only observing the value of each function at two points. While it is well-known that two-point feedback allows for gradient estimation, achieving tight high-probability regret bounds for strongly convex functions still remained open as highlighted by \citet{agarwal2010optimal}. The primary challenge lies in the heavy-tailed nature of bandit gradient estimators, which makes standard concentration analysis difficult. In this paper, we resolve this open challenge by providing the first high-probability regret bound of $O(d(\log T + \log(1/δ))/μ)$ for $μ$-strongly convex losses. Our result is minimax optimal with respect to both the time horizon $T$ and the dimension $d$.
LGOct 22, 2023
PPFL: A Personalized Federated Learning Framework for Heterogeneous PopulationHao Di, Yi Yang, Haishan Ye et al.
Personalization aims to characterize individual preferences and is widely applied across many fields. However, conventional personalized methods operate in a centralized manner, potentially exposing raw data when pooling individual information. In this paper, with privacy considerations, we develop a flexible and interpretable personalized framework within the paradigm of federated learning, called \texttt{PPFL} (Population Personalized Federated Learning). By leveraging ``canonical models" to capture fundamental characteristics of a heterogeneous population and employing ``membership vectors" to reveal clients' preferences, \texttt{PPFL} models heterogeneity as clients' varying preferences for these characteristics. This approach provides substantial insights into client characteristics, which are lacking in existing Personalized Federated Learning (PFL) methods. Furthermore, we explore the relationship between \texttt{PPFL} and three main branches of PFL methods: clustered FL, multi-task PFL, and decoupling PFL, and demonstrate the advantages of \texttt{PPFL}. To solve \texttt{PPFL} (a non-convex optimization problem with linear constraints), we propose a novel random block coordinate descent algorithm and establish its convergence properties. We conduct experiments on both pathological and practical data sets, and the results validate the effectiveness of \texttt{PPFL}.
AINov 11, 2025
MSCR: Exploring the Vulnerability of LLMs' Mathematical Reasoning Abilities Using Multi-Source Candidate ReplacementZhishen Sun, Guang Dai, Haishan Ye
LLMs demonstrate performance comparable to human abilities in complex tasks such as mathematical reasoning, but their robustness in mathematical reasoning under minor input perturbations still lacks systematic investigation. Existing methods generally suffer from limited scalability, weak semantic preservation, and high costs. Therefore, we propose MSCR, an automated adversarial attack method based on multi-source candidate replacement. By combining three information sources including cosine similarity in the embedding space of LLMs, the WordNet dictionary, and contextual predictions from a masked language model, we generate for each word in the input question a set of semantically similar candidates, which are then filtered and substituted one by one to carry out the attack. We conduct large-scale experiments on LLMs using the GSM8K and MATH500 benchmarks. The results show that even a slight perturbation involving only a single word can significantly reduce the accuracy of all models, with the maximum drop reaching 49.89% on GSM8K and 35.40% on MATH500, while preserving the high semantic consistency of the perturbed questions. Further analysis reveals that perturbations not only lead to incorrect outputs but also substantially increase the average response length, which results in more redundant reasoning paths and higher computational resource consumption. These findings highlight the robustness deficiencies and efficiency bottlenecks of current LLMs in mathematical reasoning tasks.
LGFeb 23, 2024
Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order OptimizerYanjun Zhao, Sizhe Dang, Haishan Ye et al.
Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
OCDec 22, 2025
Explicit and Non-asymptotic Query Complexities of Rank-Based Zeroth-order Algorithm on Stochastic Smooth FunctionsHaishan Ye
Zeroth-order (ZO) optimization with ordinal feedback has emerged as a fundamental problem in modern machine learning systems, particularly in human-in-the-loop settings such as reinforcement learning from human feedback, preference learning, and evolutionary strategies. While rank-based ZO algorithms enjoy strong empirical success and robustness properties, their theoretical understanding, especially under stochastic objectives and standard smoothness assumptions, remains limited. In this paper, we study rank-based zeroth-order optimization for stochastic functions where only ordinal feedback of the stochastic function is available. We propose a simple and computationally efficient rank-based ZO algorithm. Under standard assumptions including smoothness, strong convexity, and bounded second moments of stochastic gradients, we establish explicit non-asymptotic query complexity bounds for both convex and nonconvex objectives. Notably, our results match the best-known query complexities of value-based ZO algorithms, demonstrating that ordinal information alone is sufficient for optimal query efficiency in stochastic settings. Our analysis departs from existing drift-based and information-geometric techniques, offering new tools for the study of rank-based optimization under noise. These findings narrow the gap between theory and practice and provide a principled foundation for optimization driven by human preferences.
LGDec 18, 2025
Explicit and Non-asymptotic Query Complexities of Rank-Based Zeroth-order Algorithms on Smooth FunctionsHaishan Ye
Rank-based zeroth-order (ZO) optimization -- which relies only on the ordering of function evaluations -- offers strong robustness to noise and monotone transformations, and underlies many successful algorithms such as CMA-ES, natural evolution strategies, and rank-based genetic algorithms. Despite its widespread use, the theoretical understanding of rank-based ZO methods remains limited: existing analyses provide only asymptotic insights and do not yield explicit convergence rates for algorithms selecting the top-$k$ directions. This work closes this gap by analyzing a simple rank-based ZO algorithm and establishing the first \emph{explicit}, and \emph{non-asymptotic} query complexities. For a $d$-dimension problem, if the function is $L$-smooth and $μ$-strongly convex, the algorithm achieves $\widetilde{\mathcal O}\!\left(\frac{dL}μ\log\!\frac{dL}{μδ}\log\!\frac{1}{\varepsilon}\right)$ to find an $\varepsilon$-suboptimal solution, and for smooth nonconvex objectives it reaches $\mathcal O\!\left(\frac{dL}{\varepsilon}\log\!\frac{1}{\varepsilon}\right)$. Notation $\cO(\cdot)$ hides constant terms and $\widetilde{\mathcal O}(\cdot)$ hides extra $\log\log\frac{1}{\varepsilon}$ term. These query complexities hold with a probability at least $1-δ$ with $0<δ<1$. The analysis in this paper is novel and avoids classical drift and information-geometric techniques. Our analysis offers new insight into why rank-based heuristics lead to efficient ZO optimization.
LGJun 10, 2025
FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale SpeedSizhe Dang, Yangyang Guo, Yanjun Zhao et al.
Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually require many more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer toward Adam-Scale Speed. FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step sizes based on the standard deviation of batch losses. It also accelerates per-batch computation through the use of Rademacher random vector perturbations coupled with CUDA's parallel processing. Extensive experiments on diverse models, including RoBERTa-large, OPT (350M-66B), Phi-2, and Llama3, across 11 tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by 3 percent in accuracy while requiring 3 times fewer forward passes. For RoBERTa-large, FZOO achieves average improvements of 5.6 percent in accuracy and an 18 times reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO's formal equivalence to a normalized-SGD update rule and its convergence guarantees. FZOO integrates smoothly into PEFT techniques, enabling even larger memory savings. Overall, our results make single-GPU, high-speed, full-parameter fine-tuning practical and point toward future work on memory-efficient pre-training.
LGJan 13, 2025
An Enhanced Zeroth-Order Stochastic Frank-Wolfe Framework for Constrained Finite-Sum OptimizationHaishan Ye, Yinghui Huang, Hao Di et al.
We propose an enhanced zeroth-order stochastic Frank-Wolfe framework to address constrained finite-sum optimization problems, a structure prevalent in large-scale machine-learning applications. Our method introduces a novel double variance reduction framework that effectively reduces the gradient approximation variance induced by zeroth-order oracles and the stochastic sampling variance from finite-sum objectives. By leveraging this framework, our algorithm achieves significant improvements in query efficiency, making it particularly well-suited for high-dimensional optimization tasks. Specifically, for convex objectives, the algorithm achieves a query complexity of O(d \sqrt{n}/ε) to find an epsilon-suboptimal solution, where d is the dimensionality and n is the number of functions in the finite-sum objective. For non-convex objectives, it achieves a query complexity of O(d^{3/2}\sqrt{n}/ε^2 ) without requiring the computation ofd partial derivatives at each iteration. These complexities are the best known among zeroth-order stochastic Frank-Wolfe algorithms that avoid explicit gradient calculations. Empirical experiments on convex and non-convex machine learning tasks, including sparse logistic regression, robust classification, and adversarial attacks on deep networks, validate the computational efficiency and scalability of our approach. Our algorithm demonstrates superior performance in both convergence rate and query complexity compared to existing methods.
LGFeb 1
ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-TuningZhishen Sun, Sizhe Dang, Guang Dai et al.
Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27\% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72\% and is comparable to GRPO with an accuracy of 78.34\%, and even surpassing them on some models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage.
CLOct 26, 2025
Frustratingly Easy Task-aware Pruning for Large Language ModelsYuanhe Tian, Junjie Liu, Xican Yang et al.
Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often ranks the importance of LLM parameters using their magnitudes and calibration-data activations and removes (or masks) the less important ones, accordingly reducing LLMs' size. However, these approaches primarily focus on preserving the LLM's ability to generate fluent sentences, while neglecting performance on specific domains and tasks. In this paper, we propose a simple yet effective pruning approach for LLMs that preserves task-specific capabilities while shrinking their parameter space. We first analyze how conventional pruning minimizes loss perturbation under general-domain calibration and extend this formulation by incorporating task-specific feature distributions into the importance computation of existing pruning algorithms. Thus, our framework computes separate importance scores using both general and task-specific calibration data, partitions parameters into shared and exclusive groups based on activation-norm differences, and then fuses their scores to guide the pruning process. This design enables our method to integrate seamlessly with various foundation pruning techniques and preserve the LLM's specialized abilities under compression. Experiments on widely used benchmarks demonstrate that our approach is effective and consistently outperforms the baselines with identical pruning ratios and different settings.
LGMay 29, 2025
Towards Understanding The Calibration Benefits of Sharpness-Aware MinimizationChengli Tan, Yubo Zhou, Haishan Ye et al.
Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches
OCFeb 1, 2022
Decentralized Stochastic Variance Reduced Extragradient MethodLuo Luo, Haishan Ye
This paper studies decentralized convex-concave minimax optimization problems of the form $\min_x\max_y f(x,y) \triangleq\frac{1}{m}\sum_{i=1}^m f_i(x,y)$, where $m$ is the number of agents and each local function can be written as $f_i(x,y)=\frac{1}{n}\sum_{j=1}^n f_{i,j}(x,y)$. We propose a novel decentralized optimization algorithm, called multi-consensus stochastic variance reduced extragradient, which achieves the best known stochastic first-order oracle (SFO) complexity for this problem. Specifically, each agent requires $\mathcal O((n+κ\sqrt{n})\log(1/\varepsilon))$ SFO calls for strongly-convex-strongly-concave problem and $\mathcal O((n+\sqrt{n}L/\varepsilon)\log(1/\varepsilon))$ SFO call for general convex-concave problem to achieve $\varepsilon$-accurate solution in expectation, where $κ$ is the condition number and $L$ is the smoothness parameter. The numerical experiments show the proposed method performs better than baselines.
LGOct 27, 2021
Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian SpectrumsRui Pan, Haishan Ye, Tong Zhang
Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. In this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.
LGFeb 8, 2021
DeEPCA: Decentralized Exact PCA with Linear Convergence RateHaishan Ye, Tong Zhang
Due to the rapid growth of smart agents such as weakly connected computational nodes and sensors, developing decentralized algorithms that can perform computations on local agents becomes a major research direction. This paper considers the problem of decentralized Principal components analysis (PCA), which is a statistical method widely used for data analysis. We introduce a technique called subspace tracking to reduce the communication cost, and apply it to power iterations. This leads to a decentralized PCA algorithm called \texttt{DeEPCA}, which has a convergence rate similar to that of the centralized PCA, while achieving the best communication complexity among existing decentralized PCA algorithms. \texttt{DeEPCA} is the first decentralized PCA algorithm with the number of communication rounds for each power iteration independent of target precision. Compared to existing algorithms, the proposed method is easier to tune in practice, with an improved overall communication cost. Our experiments validate the advantages of \texttt{DeEPCA} empirically.
OCDec 30, 2020
PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reductionHaishan Ye, Wei Xiong, Tong Zhang
This paper considers the decentralized composite optimization problem. We propose a novel decentralized variance-reduction proximal-gradient algorithmic framework, called PMGT-VR, which is based on a combination of several techniques including multi-consensus, gradient tracking, and variance reduction. The proposed framework relies on an imitation of centralized algorithms and we demonstrate that algorithms under this framework achieve convergence rates similar to that of their centralized counterparts. We also describe and analyze two representative algorithms, PMGT-SAGA and PMGT-LSVRG, and compare them to existing state-of-the-art proximal algorithms. To the best of our knowledge, PMGT-VR is the first linearly convergent decentralized stochastic algorithm that can solve decentralized composite optimization problems. Numerical experiments are provided to demonstrate the effectiveness of the proposed algorithms.
LGSep 5, 2020
Revisiting Co-Occurring Directions: Sharper Analysis and Efficient Algorithm for Sparse MatricesLuo Luo, Cheng Chen, Guangzeng Xie et al.
We study the streaming model for approximate matrix multiplication (AMM). We are interested in the scenario that the algorithm can only take one pass over the data with limited memory. The state-of-the-art deterministic sketching algorithm for streaming AMM is the co-occurring directions (COD), which has much smaller approximation errors than randomized algorithms and outperforms other deterministic sketching methods empirically. In this paper, we provide a tighter error bound for COD whose leading term considers the potential approximate low-rank structure and the correlation of input matrices. We prove COD is space optimal with respect to our improved error bound. We also propose a variant of COD for sparse matrices with theoretical guarantees. The experiments on real-world sparse datasets show that the proposed algorithm is more efficient than baseline methods.
LGMay 2, 2020
Multi-consensus Decentralized Accelerated Gradient DescentHaishan Ye, Luo Luo, Ziang Zhou et al.
This paper considers the decentralized convex optimization problem, which has a wide range of applications in large-scale machine learning, sensor networks, and control theory. We propose novel algorithms that achieve optimal computation complexity and near optimal communication complexity. Our theoretical results give affirmative answers to the open problem on whether there exists an algorithm that can achieve a communication complexity (nearly) matching the lower bound depending on the global condition number instead of the local one. Furthermore, the linear convergence of our algorithms only depends on the strong convexity of global objective and it does \emph{not} require the local functions to be convex. The design of our methods relies on a novel integration of well-known techniques including Nesterov's acceleration, multi-consensus and gradient-tracking. Empirical studies show the outperformance of our methods for machine learning applications.
LGMar 27, 2020
MiLeNAS: Efficient Neural Architecture Search via Mixed-Level ReformulationChaoyang He, Haishan Ye, Li Shen et al.
Many recently proposed methods for Neural Architecture Search (NAS) can be formulated as bilevel optimization. For efficient implementation, its solution requires approximations of second-order methods. In this paper, we demonstrate that gradient errors caused by such approximations lead to suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. To remedy this, this paper proposes \mldas, a mixed-level reformulation for NAS that can be optimized efficiently and reliably. It is shown that even when using a simple first-order method on the mixed-level formulation, \mldas\ can achieve a lower validation error for NAS problems. Consequently, architectures obtained by our method achieve consistently higher accuracies than those obtained from bilevel optimization. Moreover, \mldas\ proposes a framework beyond DARTS. It is upgraded via model size-based search and early stopping strategies to complete the search process in around 5 hours. Extensive experiments within the convolutional architecture search space validate the effectiveness of our approach.
LGJan 11, 2020
Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax ProblemsLuo Luo, Haishan Ye, Zhichao Huang et al.
We consider nonconvex-concave minimax optimization problems of the form $\min_{\bf x}\max_{\bf y\in{\mathcal Y}} f({\bf x},{\bf y})$, where $f$ is strongly-concave in $\bf y$ but possibly nonconvex in $\bf x$ and ${\mathcal Y}$ is a convex and compact set. We focus on the stochastic setting, where we can only access an unbiased stochastic gradient estimate of $f$ at each iteration. This formulation includes many machine learning applications as special cases such as robust optimization and adversary training. We are interested in finding an ${\mathcal O}(\varepsilon)$-stationary point of the function $Φ(\cdot)=\max_{\bf y\in{\mathcal Y}} f(\cdot, {\bf y})$. The most popular algorithm to solve this problem is stochastic gradient decent ascent, which requires $\mathcal O(κ^3\varepsilon^{-4})$ stochastic gradient evaluations, where $κ$ is the condition number. In this paper, we propose a novel method called Stochastic Recursive gradiEnt Descent Ascent (SREDA), which estimates gradients more efficiently using variance reduction. This method achieves the best known stochastic gradient complexity of ${\mathcal O}(κ^3\varepsilon^{-3})$, and its dependency on $\varepsilon$ is optimal for this problem.
LGDec 27, 2019
Fast Generalized Matrix Regression with Applications in Machine LearningHaishan Ye, Shusen Wang, Zhihua Zhang et al.
Fast matrix algorithms have become the fundamental tools of machine learning in big data era. The generalized matrix regression problem is widely used in the matrix approximation such as CUR decomposition, kernel matrix approximation, and stream singular value decomposition (SVD), etc. In this paper, we propose a fast generalized matrix regression algorithm (Fast GMR) which utilizes sketching technique to solve the GMR problem efficiently. Given error parameter $0<ε<1$, the Fast GMR algorithm can achieve a $(1+ε)$ relative error with the sketching sizes being of order $\cO(ε^{-1/2})$ for a large group of GMR problems. We apply the Fast GMR algorithm to the symmetric positive definite matrix approximation and single pass singular value decomposition and they achieve a better performance than conventional algorithms. Our empirical study also validates the effectiveness and efficiency of our proposed algorithms.
OCOct 25, 2019
Mirror Natural Evolution StrategiesHaishan Ye, Tong Zhang
Evolution Strategies such as CMA-ES (covariance matrix adaptation evolution strategy) and NES (natural evolution strategy) have been widely used in machine learning applications, where an objective function is optimized without using its derivatives. However, the convergence behaviors of these algorithms have not been carefully studied. In particular, there is no rigorous analysis for the convergence of the estimated covariance matrix, and it is unclear how does the estimated covariance matrix help the converge of the algorithm. The relationship between Evolution Strategies and derivative free optimization algorithms is also not clear. In this paper, we propose a new algorithm closely related toNES, which we call MiNES (mirror descent natural evolution strategy), for which we can establish rigorous convergence results. We show that the estimated covariance matrix of MiNES converges to the inverse of Hessian matrix of the objective function with a sublinear convergence rate. Moreover, we show that some derivative free optimization algorithms are special cases of MiNES. Our empirical studies demonstrate that MiNES is a query-efficient optimization algorithm competitive to classical algorithms including NES and CMA-ES.
LGDec 29, 2018
Hessian-Aware Zeroth-Order Optimization for Black-Box Adversarial AttackHaishan Ye, Zhichao Huang, Cong Fang et al.
Zeroth-order optimization is an important research topic in machine learning. In recent years, it has become a key tool in black-box adversarial attack to neural network based image classifiers. However, existing zeroth-order optimization algorithms rarely extract second-order information of the model function. In this paper, we utilize the second-order information of the objective function and propose a novel \textit{Hessian-aware zeroth-order algorithm} called \texttt{ZO-HessAware}. Our theoretical result shows that \texttt{ZO-HessAware} has an improved zeroth-order convergence rate and query complexity under structured Hessian approximation, where we propose a few approximation methods for estimating Hessian. Our empirical studies on the black-box adversarial attack problem validate that our algorithm can achieve improved success rates with a lower query complexity.
LGOct 17, 2017
Nesterov's Acceleration For Approximate NewtonHaishan Ye, Zhihua Zhang
Optimization plays a key role in machine learning. Recently, stochastic second-order methods have attracted much attention due to their low computational cost in each iteration. However, these algorithms might perform poorly especially if it is hard to approximate the Hessian well and efficiently. As far as we know, there is no effective way to handle this problem. In this paper, we resort to Nesterov's acceleration technique to improve the convergence performance of a class of second-order methods called approximate Newton. We give a theoretical analysis that Nesterov's acceleration technique can improve the convergence performance for approximate Newton just like for first-order methods. We accordingly propose an accelerated regularized sub-sampled Newton. Our accelerated algorithm performs much better than the original regularized sub-sampled Newton in experiments, which validates our theory empirically. Besides, the accelerated regularized sub-sampled Newton has good performance comparable to or even better than classical algorithms.
LGMay 19, 2017
Nestrov's Acceleration For Second Order MethodHaishan Ye, Zhihua Zhang
Optimization plays a key role in machine learning. Recently, stochastic second-order methods have attracted much attention due to their low computational cost in each iteration. However, these algorithms might perform poorly especially if it is hard to approximate the Hessian well and efficiently. As far as we know, there is no effective way to handle this problem. In this paper, we resort to Nestrov's acceleration technique to improve the convergence performance of a class of second-order methods called approximate Newton. We give a theoretical analysis that Nestrov's acceleration technique can improve the convergence performance for approximate Newton just like for first-order methods. We accordingly propose an accelerated regularized sub-sampled Newton. Our accelerated algorithm performs much better than the original regularized sub-sampled Newton in experiments, which validates our theory empirically. Besides, the accelerated regularized sub-sampled Newton has good performance comparable to or even better than state-of-art algorithms.
NASep 8, 2016
Tighter bound of Sketched Generalized Matrix ApproximationHaishan Ye, Qiaoming Ye, Zhihua Zhang
Generalized matrix approximation plays a fundamental role in many machine learning problems, such as CUR decomposition, kernel approximation, and matrix low rank approximation. Especially with today's applications involved in larger and larger dataset, more and more efficient generalized matrix approximation algorithems become a crucially important research issue. In this paper, we find new sketching techniques to reduce the size of the original data matrix to develop new matrix approximation algorithms. Our results derive a much tighter bound for the approximation than previous works: we obtain a $(1+ε)$ approximation ratio with small sketched dimensions which implies a more efficient generalized matrix approximation.
OCSep 5, 2016
Revisiting Sub-sampled Newton MethodsHaishan Ye, Luo Luo, Zhihua Zhang
Many machine learning models depend on solving a large scale optimization problem. Recently, sub-sampled Newton methods have emerged to attract much attention for optimization due to their efficiency at each iteration, rectified a weakness in the ordinary Newton method of suffering a high cost at each iteration while commanding a high convergence rate. In this work we propose two new efficient Newton-type methods, Refined Sub-sampled Newton and Refined Sketch Newton. Our methods exhibit a great advantage over existing sub-sampled Newton methods, especially when Hessian-vector multiplication can be calculated efficiently. Specifically, the proposed methods are shown to converge superlinearly in general case and quadratically under a little stronger assumption. The proposed methods can be generalized to a unifying framework for the convergence proof of several existing sub-sampled Newton methods, revealing new convergence properties. Finally, we empirically evaluate the performance of our methods on several standard datasets and the results show consistent improvement in computational efficiency.