Guanghui Lan

OC
h-index10
51papers
4,131citations
Novelty62%
AI Score59

51 Papers

OCMar 28, 2023
Numerical Methods for Convex Multistage Stochastic Optimization

Guanghui Lan, Alexander Shapiro

Optimization problems involving sequential decisions in a stochastic environment were studied in Stochastic Programming (SP), Stochastic Optimal Control (SOC) and Markov Decision Processes (MDP). In this paper we mainly concentrate on SP and SOC modelling approaches. In these frameworks there are natural situations when the considered problems are convex. Classical approach to sequential optimization is based on dynamic programming. It has the problem of the so-called ``Curse of Dimensionality", in that its computational complexity increases exponentially with increase of dimension of state variables. Recent progress in solving convex multistage stochastic problems is based on cutting planes approximations of the cost-to-go (value) functions of dynamic programming equations. Cutting planes type algorithms in dynamical settings is one of the main topics of this paper. We also discuss Stochastic Approximation type methods applied to multistage stochastic optimization problems. From the computational complexity point of view, these two types of methods seem to be complimentary to each other. Cutting plane type methods can handle multistage problems with a large number of stages, but a relatively smaller number of state (decision) variables. On the other hand, stochastic approximation type methods can only deal with a small number of stages, but a large number of decision variables.

LGSep 21, 2022
First-order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, Tuo Zhao

We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/ε))$ iteration complexity for finding an $ε$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

LGMay 11, 2022
Stochastic first-order methods for average-reward Markov decision processes

Tianjiao Li, Feiyang Wu, Guanghui Lan

We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. Compared with intensive research efforts in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions, and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward stochastic policy mirror descent (SPMD) method for solving AMDPs with and without regularizers and provide convergence guarantees in terms of the long-term average reward. For policy evaluation, existing on-policy methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies due to the lack of exploration in the action space. To remedy these issues, we develop a variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and design an exploratory VRTD method that resolves the exploration issue and provides comparable convergence guarantees. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models. It is worth noting that when linear function approximation is utilized, our algorithm only needs to update in the low-dimensional parameter space and thus can handle MDPs with large state and action spaces.

LGNov 30, 2022
Policy Optimization over General State and Action Spaces

Caleb Ju, Guanghui Lan

Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature. Preliminary numerical results demonstrate the robustness of the aforementioned methods and show they can be competitive with state-of-the-art RL algorithms.

OCJul 4, 2023
Accelerated stochastic approximation with state-dependent noise

Sasila Ilandarideva, Anatoli Juditsky, Guanghui Lan et al.

We consider a class of stochastic smooth convex optimization problems under rather general assumptions on the noise in the stochastic gradient observation. As opposed to the classical problem setting in which the variance of noise is assumed to be uniformly bounded, herein we assume that the variance of stochastic gradients is related to the "sub-optimality" of the approximate solutions delivered by the algorithm. Such problems naturally arise in a variety of applications, in particular, in the well-known generalized linear regression problem in statistics. However, to the best of our knowledge, none of the existing stochastic approximation algorithms for solving this class of problems attain optimality in terms of the dependence on accuracy, problem parameters, and mini-batch size. We discuss two non-Euclidean accelerated stochastic approximation routines--stochastic accelerated gradient descent (SAGD) and stochastic gradient extrapolation (SGE)--which carry a particular duality relationship. We show that both SAGD and SGE, under appropriate conditions, achieve the optimal convergence rate, attaining the optimal iteration and sample complexities simultaneously. However, corresponding assumptions for the SGE algorithm are more general; they allow, for instance, for efficient application of the SGE to statistical estimation problems under heavy tail noises and discontinuous score functions. We also discuss the application of the SGE to problems satisfying quadratic growth conditions, and show how it can be used to recover sparse solutions. Finally, we report on some simulation experiments to illustrate numerical performance of our proposed algorithms in high-dimensional settings.

LGMar 8, 2023
Policy Mirror Descent Inherently Explores Action Space

Yan Li, Guanghui Lan

Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity is obtained with a linear dependence on the size of the action space. SPMD with the second evaluation operator, namely truncated on-policy Monte Carlo (TOMC), attains an $\tilde{\mathcal{O}}(\mathcal{H}_{\mathcal{D}}/ε^2)$ sample complexity, where $\mathcal{H}_{\mathcal{D}}$ mildly depends on the effective horizon and the size of the action space with properly chosen Bregman divergence (e.g., Tsallis divergence). SPMD with TOMC also exhibits stronger convergence properties in that it controls the optimality gap with high probability rather than in expectation. In contrast to explicit exploration, these new policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies.

OCJul 29, 2023
First-order Policy Optimization for Robust Policy Evaluation

Yan Li, Guanghui Lan

We adopt a policy optimization viewpoint towards policy evaluation for robust Markov decision process with $\mathrm{s}$-rectangular ambiguity sets. The developed method, named first-order policy evaluation (FRPE), provides the first unified framework for robust policy evaluation in both deterministic (offline) and stochastic (online) settings, with either tabular representation or generic function approximation. In particular, we establish linear convergence in the deterministic setting, and $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity in the stochastic setting. FRPE also extends naturally to evaluating the robust state-action value function with $(\mathrm{s}, \mathrm{a})$-rectangular ambiguity sets. We discuss the application of the developed results for stochastic policy optimization of large-scale robust MDPs.

OCOct 11, 2022
Functional Constrained Optimization for Risk Aversion and Sparsity Control

Yi Cheng, Guanghui Lan, H. Edwin Romeijn

Risk and sparsity requirements often need to be enforced simultaneously in many applications, e.g., in portfolio optimization, assortment planning, and treatment planning. Properly balancing these potentially conflicting requirements entails the formulation of functional constrained optimization with either convex or nonconvex objectives. In this paper, we focus on projection-free methods that can generate a sparse trajectory for solving these challenging functional constrained optimization problems. Specifically, for the convex setting, we propose a Level Conditional Gradient (LCG) method, which leverages a level-set framework to update the approximation of the optimal value and an inner conditional gradient oracle (CGO) for solving mini-max subproblems. We show that the method achieves $\mathcal{O}\big(\frac{1}{ε^2}\log\frac{1}ε\big)$ iteration complexity for solving both smooth and nonsmooth cases without dependency on a possibly large size of optimal dual Lagrange multiplier. For the nonconvex setting, we introduce the Level Inexact Proximal Point (IPP-LCG) method and the Direct Nonconvex Conditional Gradient (DNCG) method. The first approach taps into the advantage of LCG by transforming the problem into a series of convex subproblems and exhibits an $\mathcal{O}\big(\frac{1}{ε^3}\log\frac{1}ε\big)$ iteration complexity for finding an ($ε,ε$)-KKT point. The DNCG is the first single-loop projection-free method, with iteration complexity bounded by $\mathcal{O}\big(1/ε^4\big)$ for computing a so-called $ε$-Wolfe point. We demonstrate the effectiveness of LCG, IPP-LCG and DNCG by devising formulations and conducting numerical experiments on two risk averse sparse optimization applications: a portfolio selection problem with and without cardinality requirement, and a radiation therapy planning problem in healthcare.

OCMar 10, 2022
Optimal Methods for Convex Risk Averse Distributed Optimization

Guanghui Lan, Zhe Zhang

This paper studies the communication complexity of convex risk-averse optimization over a network. The problem generalizes the well-studied risk-neutral finite-sum distributed optimization problem and its importance stems from the need to handle risk in an uncertain environment. For algorithms in the literature, there exists a gap in communication complexities for solving risk-averse and risk-neutral problems. We propose two distributed algorithms, namely the distributed risk averse optimization (DRAO) method and the distributed risk averse optimization with sliding (DRAO-S) method, to close the gap. Specifically, the DRAO method achieves the optimal communication complexity by assuming a certain saddle point subproblem can be easily solved in the server node. The DRAO-S method removes the strong assumption by introducing a novel saddle point sliding subroutine which only requires the projection over the ambiguity set $P$. We observe that the number of $P$-projections performed by DRAO-S is optimal. Moreover, we develop matching lower complexity bounds to show the communication complexities of both DRAO and DRAO-S to be improvable. Numerical experiments are conducted to demonstrate the encouraging empirical performance of the DRAO-S method.

OCOct 16, 2023
A simple uniformly optimal method without line search for convex optimization

Tianjiao Li, Guanghui Lan

Line search (or backtracking) procedures have been widely employed into first-order methods for solving convex optimization problems, especially those with unknown problem parameters (e.g., Lipschitz constant). In this paper, we show that line search is superfluous in attaining the optimal rate of convergence for solving a convex optimization problem whose parameters are not given a priori. In particular, we present a novel accelerated gradient descent type algorithm called auto-conditioned fast gradient method (AC-FGM) that can achieve an optimal $\mathcal{O}(1/k^2)$ rate of convergence for smooth convex optimization without requiring the estimate of a global Lipschitz constant or the employment of line search procedures. We then extend AC-FGM to solve convex optimization problems with Hölder continuous gradients and show that it automatically achieves the optimal rates of convergence uniformly for all problem classes with the desired accuracy of the solution as the only input. Finally, we report some encouraging numerical results that demonstrate the advantages of AC-FGM over the previously developed parameter-free methods for convex optimization.

35.2LGMar 10
Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

Ji Gao, Caleb Ju, Guanghui Lan et al.

Policy Dual Averaging (PDA) offers a principled Policy Mirror Descent (PMD) framework that more naturally admits value function approximation than standard PMD, enabling the use of approximate advantage (or Q-) functions while retaining strong convergence guarantees. However, applying PDA in continuous state and action spaces remains computationally challenging, since action selection involves solving an optimization sub-problem at each decision step. In this paper, we propose \textit{actor-accelerated PDA}, which uses a learned policy network to approximate the solution of the optimization sub-problems, yielding faster runtimes while maintaining convergence guarantees. We provide a theoretical analysis that quantifies how actor approximation error impacts the convergence of PDA under suitable assumptions. We then evaluate its performance on several benchmarks in robotics, control, and operations research problems. Actor-accelerated PDA achieves superior performance compared to popular on-policy baselines such as Proximal Policy Optimization (PPO). Overall, our results bridge the gap between the theoretical advantages of PDA and its practical deployment in continuous-action problems with function approximation.

OCNov 13, 2025
Global Solutions to Non-Convex Functional Constrained Problems with Hidden Convexity

Ilyas Fatkhullin, Niao He, Guanghui Lan et al.

Constrained non-convex optimization is fundamentally challenging, as global solutions are generally intractable and constraint qualifications may not hold. However, in many applications, including safe policy optimization in control and reinforcement learning, such problems possess hidden convexity, meaning they can be reformulated as convex programs via a nonlinear invertible transformation. Typically such transformations are implicit or unknown, making the direct link with the convex program impossible. On the other hand, (sub-)gradients with respect to the original variables are often accessible or can be easily estimated, which motivates algorithms that operate directly in the original (non-convex) problem space using standard (sub-)gradient oracles. In this work, we develop the first algorithms to provably solve such non-convex problems to global minima. First, using a modified inexact proximal point method, we establish global last-iterate convergence guarantees with $\widetilde{\mathcal{O}}(\varepsilon^{-3})$ oracle complexity in non-smooth setting. For smooth problems, we propose a new bundle-level type method based on linearly constrained quadratic subproblems, improving the oracle complexity to $\widetilde{\mathcal{O}}(\varepsilon^{-1})$. Surprisingly, despite non-convexity, our methodology does not require any constraint qualifications, can handle hidden convex equality constraints, and achieves complexities matching those for solving unconstrained hidden convex optimization.

83.0OCApr 7
Stochastic Auto-conditioned Fast Gradient Methods with Optimal Rates

Yao Ji, Guanghui Lan

Achieving optimal rates for stochastic composite convex optimization without prior knowledge of problem parameters remains a central challenge. In the deterministic setting, the auto-conditioned fast gradient method has recently been proposed to attain optimal accelerated rates without line-search procedures or prior knowledge of the Lipschitz smoothness constant, providing a natural prototype for parameter-free acceleration. However, extending this approach to the stochastic setting has proven technically challenging and remains open. Existing parameter-free stochastic methods either fail to achieve accelerated rates or rely on restrictive assumptions, such as bounded domains, bounded gradients, prior knowledge of the iteration horizon, or strictly sub-Gaussian noise. To address these limitations, we propose a stochastic variant of the auto-conditioned fast gradient method, referred to as stochastic AC-FGM. The proposed method is fully adaptive to the Lipschitz constant, the iteration horizon, and the noise level, enabling both adaptive stepsize selection and adaptive mini-batch sizing without line-search procedures. Under standard bounded conditional variance assumptions, we show that stochastic AC-FGM achieves the optimal iteration complexity of $O(1/\sqrt{\varepsilon})$ and the optimal sample complexity of $O(1/\varepsilon^2)$.

LGSep 28, 2024
Strongly-polynomial time and validation analysis of policy gradient methods

Caleb Ju, Guanghui Lan

This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that policy gradient methods can solve MDPs in strongly-polynomial time. To the best of our knowledge, this is the first time that such strong convergence properties have been established for policy gradient methods. Moreover, in the stochastic setting, where only stochastic estimates of policy gradients are available, we show that the advantage gap function provides close approximations of the optimality gap for each individual state and exhibits a sublinear rate of convergence at every state. The advantage gap function can be easily estimated in the stochastic case, and when coupled with easily computable upper bounds on policy values, they provide a convenient way to validate the solutions generated by policy gradient methods. Therefore, our developments offer a principled and computable measure of optimality for RL, whereas current practice tends to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality.

OCDec 18, 2024
Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizes

Guanghui Lan, Tianjiao Li, Yangyang Xu

We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the "vanilla" PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an "auto-conditioned" projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.

OCAug 6, 2025
Can SGD Handle Heavy-Tailed Noise?

Ilyas Fatkhullin, Florian Hübler, Guanghui Lan

Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise -- common in modern machine learning and reinforcement learning -- remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded $p$-th moments for some $p \in (1, 2]$, we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: $\mathcal{O}(\varepsilon^{-\frac{p}{p-1}})$ and $\mathcal{O}(\varepsilon^{-\frac{p}{2(p-1)}})$, respectively. For non-convex objectives under Hölder smoothness, we prove convergence to a stationary point with rate $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$, and complement this with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we consider non-convex Mini-batch SGD under standard smoothness and bounded central moment assumptions, and show that it also achieves a comparable $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$ sample complexity with a potential improvement in the smoothness constant. These results challenge the prevailing view that heavy-tailed noise renders SGD ineffective, and establish vanilla SGD as a robust and theoretically principled baseline -- even in regimes where the variance is unbounded.

64.4OCApr 7
Value Mirror Descent for Reinforcement Learning

Zhichao Jia, Guanghui Lan

Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor $γ\in(0,1)$ and costs in $[0,1]$. We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we develop a stochastic variant, SVMD, which incorporates variance reduction commonly used in stochastic value iteration-type methods. For RL problems with general convex regularizers, SVMD attains a near-optimal sample complexity of $\tilde{O}(|S||A|(1-γ)^{-3}ε^{-2})$. Moreover, we establish that the Bregman divergence between the generated and optimal policies remains bounded throughout the iterations. This property is absent in existing stochastic value iteration-type methods but is important for enabling effective online (continual) learning following offline training. Under a strongly convex regularizer, SVMD achieves sample complexity of $\tilde{O}(|S||A|(1-γ)^{-5}ε^{-1})$, improving performance in the high-accuracy regime. Furthermore, we prove convergence of the generated policy to the optimal policy. Overall, the proposed method, its analysis, and the resulting guarantees, constitute new contributions to the RL and optimization literature.

LGFeb 16, 2022
Data-Driven Minimax Optimization with Expectation Constraints

Shuoguang Yang, Xudong Li, Guanghui Lan

Attention to data-driven optimization approaches, including the well-known stochastic gradient descent method, has grown significantly over recent decades, but data-driven constraints have rarely been studied, because of the computational challenges of projections onto the feasible set defined by these hard constraints. In this paper, we focus on the non-smooth convex-concave stochastic minimax regime and formulate the data-driven constraints as expectation constraints. The minimax expectation constrained problem subsumes a broad class of real-world applications, including two-player zero-sum game and data-driven robust optimization. We propose a class of efficient primal-dual algorithms to tackle the minimax expectation-constrained problem, and show that our algorithms converge at the optimal rate of $\mathcal{O}(\frac{1}{\sqrt{N}})$. We demonstrate the practical efficiency of our algorithms by conducting numerical experiments on large-scale real-world applications.

LGJan 24, 2022
Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Yan Li, Guanghui Lan, Tuo Zhao

We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / ε^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.

LGJan 15, 2022
Block Policy Mirror Descent

Guanghui Lan, Yan Li, Tuo Zhao

In this paper, we present a new policy gradient (PG) methods, namely the block policy mirror descent (BPMD) method for solving a class of regularized reinforcement learning (RL) problems with (strongly)-convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes, and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to comparable worst-case total computational complexity as batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. With a generative model, $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A}\right\rvert /ε)$ (resp. $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A} \right\rvert /ε^2)$) sample complexities are established for the strongly-convex (resp. non-strongly-convex) regularizers, where $ε$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.

MLDec 24, 2021
Accelerated and instance-optimal policy evaluation with linear function approximation

Tianjiao Li, Guanghui Lan, Ashwin Pananjady

We study the problem of policy evaluation with linear function approximation and present efficient and practical algorithms that come with strong optimality guarantees. We begin by proving lower bounds that establish baselines on both the deterministic error and stochastic error in this problem. In particular, we prove an oracle complexity lower bound on the deterministic error in an instance-dependent norm associated with the stationary distribution of the transition kernel, and use the local asymptotic minimax machinery to prove an instance-dependent lower bound on the stochastic error in the i.i.d. observation model. Existing algorithms fail to match at least one of these lower bounds: To illustrate, we analyze a variance-reduced variant of temporal difference learning, showing in particular that it fails to achieve the oracle complexity lower bound. To remedy this issue, we develop an accelerated, variance-reduced fast temporal difference algorithm (VRFTD) that simultaneously matches both lower bounds and attains a strong notion of instance-optimality. Finally, we extend the VRFTD algorithm to the setting with Markovian observations, and provide instance-dependent convergence results. Our theoretical guarantees of optimality are corroborated by numerical experiments.

OCOct 20, 2021
Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

Tianjiao Li, Ziwei Guan, Shaofeng Zou et al.

The problem of constrained Markov decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its utilities/costs. A new primal-dual approach is proposed with a novel integration of three ingredients: entropy regularized policy optimizer, dual variable regularizer, and Nesterov's accelerated gradient descent dual optimizer, all of which are critical to achieve a faster convergence. The finite-time error bound of the proposed approach is characterized. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge to the global optimum with a complexity of $\tilde{\mathcal O}(1/ε)$ in terms of the optimality gap and the constraint violation, which improves the complexity of the existing primal-dual approach by a factor of $\mathcal O(1/ε)$ \citep{ding2020natural,paternain2019constrained}. This is the first demonstration that nonconcave CMDP problems can attain the complexity lower bound of $\mathcal O(1/ε)$ for convex optimization subject to convex constraints. Our primal-dual approach and non-asymptotic analysis are agnostic to the RL optimizer used, and thus are more flexible for practical applications. More generally, our approach also serves as the first algorithm that provably accelerates constrained nonconvex optimization with zero duality gap by exploiting the geometries such as the gradient dominance condition, for which the existing acceleration methods for constrained convex optimization are not applicable.

LGOct 10, 2021
Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Yan Li, Dhruv Choudhary, Xiaohan Wei et al.

Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.

LGJan 30, 2021
Policy Mirror Descent for Reinforcement Learning: Linear Convergence, New Sampling Complexity, and Generalized Problem Classes

Guanghui Lan

We present new policy mirror descent (PMD) methods for solving reinforcement learning (RL) problems with either strongly convex or general convex regularizers. By exploring the structural properties of these overall highly nonconvex problems we show that the PMD methods exhibit fast linear rate of convergence to the global optimality. We develop stochastic counterparts of these methods, and establish an ${\cal O}(1/ε)$ (resp., ${\cal O}(1/ε^2)$) sampling complexity for solving these RL problems with strongly (resp., general) convex regularizers using different sampling schemes, where $ε$ denote the target accuracy. We further show that the complexity for computing the gradients of these regularizers, if necessary, can be bounded by ${\cal O}\{(\log_γε) [(1-γ)L/μ]^{1/2}\log (1/ε)\}$ (resp., ${\cal O} \{(\log_γε) (L/ε)^{1/2}\}$)for problems with strongly (resp., general) convex regularizers. Here $γ$ denotes the discounting factor. To the best of our knowledge, these complexity bounds, along with our algorithmic developments, appear to be new in both optimization and RL literature. The introduction of these convex regularizers also greatly expands the flexibility and applicability of RL models.

OCNov 19, 2020
Optimal Algorithms for Convex Nested Stochastic Composite Optimization

Zhe Zhang, Guanghui Lan

Recently, convex nested stochastic composite optimization (NSCO) has received considerable attention for its applications in reinforcement learning and risk-averse optimization. The current NSCO algorithms have worse stochastic oracle complexities, by orders of magnitude, than those for simpler stochastic composite optimization problems (e.g., sum of smooth and nonsmooth functions) without the nested structure. Moreover, they require all outer-layer functions to be smooth, which is not satisfied by some important applications. These discrepancies prompt us to ask: ``does the nested composition make stochastic optimization more difficult in terms of the order of oracle complexity?" In this paper, we answer the question by developing order-optimal algorithms for the convex NSCO problem constructed from an arbitrary composition of smooth, structured non-smooth and general non-smooth layer functions. When all outer-layer functions are smooth, we propose a stochastic sequential dual (SSD) method to achieve an oracle complexity of $\mathcal{O}(1/ε^2)$ ($\mathcal{O}(1/ε)$) when the problem is non-strongly (strongly) convex. When there exists some structured non-smooth or general non-smooth outer-layer function, we propose a nonsmooth stochastic sequential dual (nSSD) method to achieve an oracle complexity of $\mathcal{O}(1/ε^2)$. We provide a lower complexity bound to show the latter $\mathcal{O}(1/ε^2)$ complexity to be unimprovable even under a strongly convex setting. All these complexity results seem to be new in the literature and they indicate that the convex NSCO problem has the same order of oracle complexity as those without the nested composition in all but the strongly convex and outer-non-smooth problem.

OCNov 15, 2020
Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning

Georgios Kotsalis, Guanghui Lan, Tianjiao Li

The focus of this paper is on stochastic variational inequalities (VI) under Markovian noise. A prominent application of our algorithmic developments is the stochastic policy evaluation problem in reinforcement learning. Prior investigations in the literature focused on temporal difference (TD) learning by employing nonsmooth finite time analysis motivated by stochastic subgradient descent leading to certain limitations. These encompass the requirement of analyzing a modified TD algorithm that involves projection to an a-priori defined Euclidean ball, achieving a non-optimal convergence rate and no clear way of deriving the beneficial effects of parallel implementation. Our approach remedies these shortcomings in the broader context of stochastic VIs and in particular when it comes to stochastic policy evaluation. We developed a variety of simple TD learning type algorithms motivated by its original version that maintain its simplicity, while offering distinct advantages from a non-asymptotic analysis point of view. We first provide an improved analysis of the standard TD algorithm that can benefit from parallel implementation. Then we present versions of a conditional TD algorithm (CTD), that involves periodic updates of the stochastic iterates, which reduce the bias and therefore exhibit improved iteration complexity. This brings us to the fast TD (FTD) algorithm which combines elements of CTD and the stochastic operator extrapolation method of the companion paper. For a novel index resetting policy FTD exhibits the best known convergence rate. We also devised a robust version of the algorithm that is particularly suitable for discounting factors close to 1.

LGNov 11, 2020
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee

Tengyu Xu, Yingbin Liang, Guanghui Lan

In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and meanwhile avoids violation of certain constraints on a number of expected total costs. In general, such SRL problems have nonconvex objective functions subject to multiple nonconvex constraints, and hence are very challenging to solve, particularly to provide a globally optimal policy. Many popular SRL algorithms adopt a primal-dual structure which utilizes the updating of dual variables for satisfying the constraints. In contrast, we propose a primal approach, called constraint-rectified policy optimization (CRPO), which updates the policy alternatingly between objective improvement and constraint satisfaction. CRPO provides a primal-type algorithmic framework to solve SRL problems, where each policy update can take any variant of policy optimization step. To demonstrate the theoretical performance of CRPO, we adopt natural policy gradient (NPG) for each policy update step and show that CRPO achieves an $\mathcal{O}(1/\sqrt{T})$ convergence rate to the global optimal policy in the constrained policy set and an $\mathcal{O}(1/\sqrt{T})$ error bound on constraint satisfaction. This is the first finite-time analysis of primal SRL algorithms with global optimality guarantee. Our empirical results demonstrate that CRPO can outperform the existing primal-dual baseline algorithms significantly.

OCNov 5, 2020
Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation

Georgios Kotsalis, Guanghui Lan, Tianjiao Li

In this paper we first present a novel operator extrapolation (OE) method for solving deterministic variational inequality (VI) problems. Similar to the gradient (operator) projection method, OE updates one single search sequence by solving a single projection subproblem in each iteration. We show that OE can achieve the optimal rate of convergence for solving a variety of VI problems in a much simpler way than existing approaches. We then introduce the stochastic operator extrapolation (SOE) method and establish its optimal convergence behavior for solving different stochastic VI problems. In particular, SOE achieves the optimal complexity for solving a fundamental problem, i.e., stochastic smooth and strongly monotone VI, for the first time in the literature. We also present a stochastic block operator extrapolations (SBOE) method to further reduce the iteration cost for the OE method applied to large-scale deterministic VIs with a certain block structure. Numerical experiments have been conducted to demonstrate the potential advantages of the proposed algorithms. In fact, all these algorithms are applied to solve generalized monotone variational inequality (GMVI) problems whose operator is not necessarily monotone. We will also discuss optimal OE-based policy evaluation methods for reinforcement learning in a companion paper.

OCOct 23, 2020
A Feasible Level Proximal Point Method for Nonconvex Sparse Constrained Optimization

Digvijay Boob, Qi Deng, Guanghui Lan et al.

Nonconvex sparse models have received significant attention in high-dimensional machine learning. In this paper, we study a new model consisting of a general convex or nonconvex objectives and a variety of continuous nonconvex sparsity-inducing constraints. For this constrained model, we propose a novel proximal point algorithm that solves a sequence of convex subproblems with gradually relaxed constraint levels. Each subproblem, having a proximal point objective and a convex surrogate constraint, can be efficiently solved based on a fast routine for projection onto the surrogate constraint. We establish the asymptotic convergence of the proposed algorithm to the Karush-Kuhn-Tucker (KKT) solutions. We also establish new convergence complexities to achieve an approximate KKT solution when the objective can be smooth/nonsmooth, deterministic/stochastic and convex/nonconvex with complexity that is on a par with gradient descent for unconstrained optimization problems in respective cases. To the best of our knowledge, this is the first study of the first-order methods with complexity guarantee for nonconvex sparse-constrained problems. We perform numerical experiments to demonstrate the effectiveness of our new model and efficiency of the proposed algorithm for large scale problems.

OCJun 30, 2020
Conditional Gradient Methods for Convex Optimization with General Affine and Nonlinear Constraints

Guanghui Lan, Edwin Romeijn, Zhiqiang Zhou

Conditional gradient methods have attracted much attention in both machine learning and optimization communities recently. These simple methods can guarantee the generation of sparse solutions. In addition, without the computation of full gradients, they can handle huge-scale problems sometimes even with an exponentially increasing number of decision variables. This paper aims to significantly expand the application areas of these methods by presenting new conditional gradient methods for solving convex optimization problems with general affine and nonlinear constraints. More specifically, we first present a new constraint extrapolated condition gradient (CoexCG) method that can achieve an ${\cal O}(1/ε^2)$ iteration complexity for both smooth and structured nonsmooth function constrained convex optimization. We further develop novel variants of CoexCG, namely constraint extrapolated and dual regularized conditional gradient (CoexDurCG) methods, that can achieve similar iteration complexity to CoexCG but allow adaptive selection for algorithmic parameters. We illustrate the effectiveness of these methods for solving an important class of radiation therapy treatment planning problems arising from healthcare industry. To the best of our knowledge, all the algorithmic schemes and their complexity results are new in the area of projection-free methods.

OCJun 3, 2020
A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems

Zi Xu, Huiling Zhang, Yang Xu et al.

Much recent research effort has been directed to the development of efficient algorithms for solving minimax problems with theoretical convergence guarantees due to the relevance of these problems to a few emergent applications. In this paper, we propose a unified single-loop alternating gradient projection (AGP) algorithm for solving smooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. AGP employs simple gradient projection steps for updating the primal and dual variables alternatively at each iteration. We show that it can find an $\varepsilon$-stationary point of the objective function in $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) iterations under nonconvex-strongly concave (resp. nonconvex-concave) setting. Moreover, its gradient complexity to obtain an $\varepsilon$-stationary point of the objective function is bounded by $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp., $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under the strongly convex-nonconcave (resp., convex-nonconcave) setting. To the best of our knowledge, this is the first time that a simple and unified single-loop algorithm is developed for solving both nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. Moreover, the complexity results for solving the latter (strongly) convex-nonconcave minimax problems have never been obtained before in the literature. Numerical results show the efficiency of the proposed AGP algorithm. Furthermore, we extend the AGP algorithm by presenting a block alternating proximal gradient (BAPG) algorithm for solving more general multi-block nonsmooth nonconvex-(strongly) concave and (strongly) convex-nonconcave minimax problems. We can similarly establish the gradient complexity of the proposed algorithm under these four different settings.

OCDec 16, 2019
Complexity of Stochastic Dual Dynamic Programming

Guanghui Lan

Stochastic dual dynamic programming is a cutting plane type algorithm for multi-stage stochastic optimization originated about 30 years ago. In spite of its popularity in practice, there does not exist any analysis on the convergence rates of this method. In this paper, we first establish the number of iterations, i.e., iteration complexity, required by a basic dynamic cutting plane method for solving relatively simple multi-stage optimization problems, by introducing novel mathematical tools including the saturation of search points. We then refine these basic tools and establish the iteration complexity for both deterministic and stochastic dual dynamic programming methods for solving more general multi-stage stochastic optimization problems under the standard stage-wise independence assumption. Our results indicate that the complexity of some deterministic variants of these methods mildly increases with the number of stages $T$, in fact linearly dependent on $T$ for discounted problems. Therefore, they are efficient for strategic decision making which involves a large number of stages, but with a relatively small number of decision variables in each stage. Without explicitly discretizing the state and action spaces, these methods might also be pertinent to the related reinforcement learning and stochastic control areas.

OCAug 7, 2019
Stochastic First-order Methods for Convex and Nonconvex Functional Constrained Optimization

Digvijay Boob, Qi Deng, Guanghui Lan

Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning, and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with a stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Contrary to existing primal-dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problems, we introduce a new proximal point method that transforms the initial nonconvex problem into a sequence of convex problems by adding quadratic terms to both the objective and constraints. Under a certain MFCQ-type assumption, we establish the convergence and rate of convergence of this method to KKT points when the convex subproblems are solved exactly or inexactly. For large-scale and stochastic problems, we present a more practical proximal point method in which the approximate solutions of the subproblems are computed by the aforementioned ConEx method. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.

LGJun 1, 2019
GLAD: Learning Sparse Graph Recovery

Harsh Shrivastava, Xinshi Chen, Binghong Chen et al.

Recovering sparse conditional independence graphs from data is a fundamental problem in machine learning with wide applications. A popular formulation of the problem is an $\ell_1$ regularized maximum likelihood estimation. Many convex optimization algorithms have been designed to solve this formulation to recover the graph structure. Recently, there is a surge of interest to learn algorithms directly based on data, and in this case, learn to map empirical covariance to the sparse precision matrix. However, it is a challenging task in this case, since the symmetric positive definiteness (SPD) and sparsity of the matrix are not easy to enforce in learned algorithms, and a direct mapping from data to precision matrix may contain many parameters. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization (AM) algorithm as our model inductive bias, and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data.

OCMay 29, 2019
A unified variance-reduced accelerated gradient method for convex optimization

Guanghui Lan, Zhize Li, Yi Zhou

We propose a novel randomized incremental gradient algorithm, namely, VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization. Equipped with a unified step-size policy that adjusts itself to the value of the condition number, Varag exhibits the unified optimal rates of convergence for solving smooth convex finite-sum problems directly regardless of their strong convexity. Moreover, Varag is the first accelerated randomized incremental gradient method that benefits from the strong convexity of the data-fidelity term to achieve the optimal linear convergence. It also establishes an optimal linear rate of convergence for solving a wide class of problems only satisfying a certain error bound condition rather than strong convexity. Varag can also be extended to solve stochastic finite-sum problems.

OCOct 9, 2018
Cubic Regularization with Momentum for Nonconvex Optimization

Zhe Wang, Yi Zhou, Yingbin Liang et al.

Momentum is a popular technique to accelerate the convergence in practical training, and its impact on convergence guarantee has been well-studied for first-order algorithms. However, such a successful acceleration technique has not yet been proposed for second-order algorithms in nonconvex optimization.In this paper, we apply the momentum scheme to cubic regularized (CR) Newton's method and explore the potential for acceleration. Our numerical experiments on various nonconvex optimization problems demonstrate that the momentum scheme can substantially facilitate the convergence of cubic regularization, and perform even better than the Nesterov's acceleration scheme for CR. Theoretically, we prove that CR under momentum achieves the best possible convergence rate to a second-order stationary point for nonconvex optimization. Moreover, we study the proposed algorithm for solving problems satisfying an error bound condition and establish a local quadratic convergence rate. Then, particularly for finite-sum problems, we show that the proposed algorithm can allow computational inexactness that reduces the overall sample complexity without degrading the convergence rate.

MLOct 1, 2018
Optimal Adaptive and Accelerated Stochastic Gradient Descent

Qi Deng, Yi Cheng, Guanghui Lan

Stochastic gradient descent (\textsc{Sgd}) methods are the most powerful optimization tools in training machine learning and deep learning models. Moreover, acceleration (a.k.a. momentum) methods and diagonal scaling (a.k.a. adaptive gradient) methods are the two main techniques to improve the slow convergence of \textsc{Sgd}. While empirical studies have demonstrated potential advantages of combining these two techniques, it remains unknown whether these methods can achieve the optimal rate of convergence for stochastic optimization. In this paper, we present a new class of adaptive and accelerated stochastic gradient descent methods and show that they exhibit the optimal sampling and iteration complexity for stochastic optimization. More specifically, we show that diagonal scaling, initially designed to improve vanilla stochastic gradient, can be incorporated into accelerated stochastic gradient descent to achieve the optimal rate of convergence for smooth stochastic optimization. We also show that momentum, apart from being known to speed up the convergence rate of deterministic optimization, also provides us new ways of designing non-uniform and aggressive moving average schemes in stochastic optimization. Finally, we present some heuristics that help to implement adaptive accelerated stochastic gradient descent methods and to further improve their practical performance for machine learning and deep learning.

CCSep 27, 2018
Complexity of Training ReLU Neural Network

Digvijay Boob, Santanu S. Dey, Guanghui Lan

In this paper, we explore some basic questions on the complexity of training neural networks with ReLU activation function. We show that it is NP-hard to train a two-hidden layer feedforward ReLU neural network. If dimension of the input data and the network topology is fixed, then we show that there exists a polynomial time algorithm for the same training problem. We also show that if sufficient over-parameterization is provided in the first hidden layer of ReLU neural network, then there is a polynomial time algorithm which finds weights such that output of the over-parameterized ReLU neural network matches with the output of the given data.

OCSep 24, 2018
Asynchronous decentralized accelerated stochastic gradient descent

Guanghui Lan, Yi Zhou

In this work, we introduce an asynchronous decentralized accelerated stochastic gradient descent type of method for decentralized stochastic optimization, considering communication and synchronization are the major bottlenecks. We establish $\mathcal{O}(1/ε)$ (resp., $\mathcal{O}(1/\sqrtε)$) communication complexity and $\mathcal{O}(1/ε^2)$ (resp., $\mathcal{O}(1/ε)$) sampling complexity for solving general convex (resp., strongly convex) problems.

OCAug 22, 2018
A Note on Inexact Condition for Cubic Regularized Newton's Method

Zhe Wang, Yi Zhou, Yingbin Liang et al.

This note considers the inexact cubic-regularized Newton's method (CR), which has been shown in \cite{Cartis2011a} to achieve the same order-level convergence rate to a secondary stationary point as the exact CR \citep{Nesterov2006}. However, the inexactness condition in \cite{Cartis2011a} is not implementable due to its dependence on future iterates variable. This note fixes such an issue by proving the same convergence rate for nonconvex optimization under an inexact adaptive condition that depends on only the current iterate. Our proof controls the sufficient decrease of the function value over the total iterations rather than each iteration as used in the previous studies, which can be of independent interest in other contexts.

OCFeb 20, 2018
Stochastic Variance-Reduced Cubic Regularization for Nonconvex Optimization

Zhe Wang, Yi Zhou, Yingbin Liang et al.

Cubic regularization (CR) is an optimization method with emerging popularity due to its capability to escape saddle points and converge to second-order stationary solutions for nonconvex optimization. However, CR encounters a high sample complexity issue for finite-sum problems with a large data size. %Various inexact variants of CR have been proposed to improve the sample complexity. In this paper, we propose a stochastic variance-reduced cubic-regularization (SVRC) method under random sampling, and study its convergence guarantee as well as sample complexity. We show that the iteration complexity of SVRC for achieving a second-order stationary solution within $ε$ accuracy is $O(ε^{-3/2})$, which matches the state-of-art result on CR types of methods. Moreover, our proposed variance reduction scheme significantly reduces the per-iteration sample complexity. The resulting total Hessian sample complexity of our SVRC is ${\Oc}(N^{2/3} ε^{-3/2})$, which outperforms the state-of-art result by a factor of $O(N^{2/15})$. We also study our SVRC under random sampling without replacement scheme, which yields a lower per-iteration sample complexity, and hence justifies its practical applicability.

OCNov 15, 2017
Random gradient extrapolation for distributed and stochastic optimization

Guanghui Lan, Yi Zhou

In this paper, we consider a class of finite-sum convex optimization problems defined over a distributed multiagent network with $m$ agents connected to a central server. In particular, the objective function consists of the average of $m$ ($\ge 1$) smooth components associated with each network agent together with a strongly convex term. Our major contribution is to develop a new randomized incremental gradient algorithm, namely random gradient extrapolation method (RGEM), which does not require any exact gradient evaluation even for the initial point, but can achieve the optimal ${\cal O}(\log(1/ε))$ complexity bound in terms of the total number of gradient evaluations of component functions to solve the finite-sum problems. Furthermore, we demonstrate that for stochastic finite-sum optimization problems, RGEM maintains the optimal ${\cal O}(1/ε)$ complexity (up to a certain logarithmic factor) in terms of the number of stochastic gradient computations, but attains an ${\cal O}(\log(1/ε))$ complexity in terms of communication rounds (each round involves only one agent). It is worth noting that the former bound is independent of the number of agents $m$, while the latter one only linearly depends on $m$ or even $\sqrt m$ for ill-conditioned problems. To the best of our knowledge, this is the first time that these complexity bounds have been obtained for distributed and stochastic optimization problems. Moreover, our algorithms were developed based on a novel dual perspective of Nesterov's accelerated gradient method.

LGOct 30, 2017
Theoretical properties of the global optimizer of two layer neural network

Digvijay Boob, Guanghui Lan

In this paper, we study the problem of optimizing a two-layer artificial neural network that best fits a training dataset. We look at this problem in the setting where the number of parameters is greater than the number of sampled points. We show that for a wide class of differentiable activation functions (this class involves "almost" all functions which are not piecewise linear), we have that first-order optimal solutions satisfy global optimality provided the hidden layer is non-singular. Our results are easily extended to hidden layers given by a flat matrix from that of a square matrix. Results are applicable even if network has more than one hidden layer provided all hidden layers satisfy non-singularity, all activations are from the given "good" class of differentiable functions and optimization is only with respect to the last hidden layer. We also study the smoothness properties of the objective function and show that it is actually Lipschitz smooth, i.e., its gradients do not change sharply. We use smoothness properties to guarantee asymptotic convergence of O(1/number of iterations) to a first-order optimal solution. We also show that our algorithm will maintain non-singularity of hidden layer for any finite number of iterations.

OCJul 11, 2017
Dynamic Stochastic Approximation for Multi-stage Stochastic Optimization

Guanghui Lan, Zhiqiang Zhou

In this paper, we consider multi-stage stochastic optimization problems with convex objectives and conic constraints at each stage. We present a new stochastic first-order method, namely the dynamic stochastic approximation (DSA) algorithm, for solving these types of stochastic optimization problems. We show that DSA can achieve an optimal ${\cal O}(1/ε^4)$ rate of convergence in terms of the total number of required scenarios when applied to a three-stage stochastic optimization problem. We further show that this rate of convergence can be improved to ${\cal O}(1/ε^2)$ when the objective function is strongly convex. We also discuss variants of DSA for solving more general multi-stage stochastic optimization problems with the number of stages $T > 3$. The developed DSA algorithms only need to go through the scenario tree once in order to compute an $ε$-solution of the multi-stage stochastic optimization problem. As a result, the memory required by DSA only grows linearly with respect to the number of stages. To the best of our knowledge, this is the first time that stochastic approximation type methods are generalized for multi-stage stochastic optimization with $T \ge 3$.

LGMar 16, 2017
Conditional Accelerated Lazy Stochastic Gradient Descent

Guanghui Lan, Sebastian Pokutta, Yi Zhou et al.

In this work we introduce a conditional accelerated lazy stochastic gradient descent algorithm with optimal number of calls to a stochastic first-order oracle and convergence rate $O\left(\frac{1}{\varepsilon^2}\right)$ improving over the projection-free, Online Frank-Wolfe based stochastic gradient descent of Hazan and Kale [2012] with convergence rate $O\left(\frac{1}{\varepsilon^4}\right)$.

OCJan 14, 2017
Communication-Efficient Algorithms for Decentralized and Stochastic Optimization

Guanghui Lan, Soomin Lee, Yi Zhou

We present a new class of decentralized first-order methods for nonsmooth and stochastic optimization problems defined over multiagent networks. Considering that communication is a major bottleneck in decentralized optimization, our main goal in this paper is to develop algorithmic frameworks which can significantly reduce the number of inter-node communications. We first propose a decentralized primal-dual method which can find an $ε$-solution both in terms of functional optimality gap and feasibility residual in $O(1/ε)$ inter-node communication rounds when the objective functions are convex and the local primal subproblems are solved exactly. Our major contribution is to present a new class of decentralized primal-dual type algorithms, namely the decentralized communication sliding (DCS) methods, which can skip the inter-node communications while agents solve the primal subproblems iteratively through linearizations of their local objective functions. By employing DCS, agents can still find an $ε$-solution in $O(1/ε)$ (resp., $O(1/\sqrtε)$) communication rounds for general convex functions (resp., strongly convex functions), while maintaining the $O(1/ε^2)$ (resp., $O(1/ε)$) bound on the total number of intra-node subgradient evaluations. We also present a stochastic counterpart for these algorithms, denoted by SDCS, for solving stochastic optimization problems whose objective function cannot be evaluated exactly. In comparison with existing results for decentralized nonsmooth and stochastic optimization, we can reduce the total number of inter-node communication rounds by orders of magnitude while still maintaining the optimal complexity bounds on intra-node stochastic subgradient evaluations. The bounds on the subgradient evaluations are actually comparable to those required for centralized nonsmooth and stochastic optimization.

OCApr 13, 2016
Algorithms for stochastic optimization with functional or expectation constraints

Guanghui Lan, Zhiqiang Zhou

This paper considers the problem of minimizing an expectation function over a closed convex set, coupled with a {\color{black} functional or expectation} constraint on either decision variables or problem parameters. We first present a new stochastic approximation (SA) type algorithm, namely the cooperative SA (CSA), to handle problems with the constraint on devision variables. We show that this algorithm exhibits the optimal ${\cal O}(1/ε^2)$ rate of convergence, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex, where $ε$ denotes the optimality gap and infeasibility. Moreover, we show that this rate of convergence can be improved to ${\cal O}(1/ε)$ if the objective and constraint functions are strongly convex. We then present a variant of CSA, namely the cooperative stochastic parameter approximation (CSPA) algorithm, to deal with the situation when the constraint is defined over problem parameters and show that it exhibits similar optimal rate of convergence to CSA. It is worth noting that CSA and CSPA are primal methods which do not require the iterations on the dual space and/or the estimation on the size of the dual variables. To the best of our knowledge, this is the first time that such optimal SA methods for solving functional or expectation constrained stochastic optimization are presented in the literature.

OCAug 29, 2015
Generalized Uniformly Optimal Methods for Nonlinear Programming

Saeed Ghadimi, Guanghui Lan, Hongchao Zhang

In this paper, we present a generic framework to extend existing uniformly optimal convex programming algorithms to solve more general nonlinear, possibly nonconvex, optimization problems. The basic idea is to incorporate a local search step (gradient descent or Quasi-Newton iteration) into these uniformly optimal convex programming methods, and then enforce a monotone decreasing property of the function values computed along the trajectory. Algorithms of these types will then achieve the best known complexity for nonconvex problems, and the optimal complexity for convex ones without requiring any problem parameters. As a consequence, we can have a unified treatment for a general class of nonlinear programming problems regardless of their convexity and smoothness level. In particular, we show that the accelerated gradient and level methods, both originally designed for solving convex optimization problems only, can be used for solving both convex and nonconvex problems uniformly. In a similar vein, we show that some well-studied techniques for nonlinear programming, e.g., Quasi-Newton iteration, can be embedded into optimal convex optimization algorithms to possibly further enhance their numerical performance. Our theoretical and algorithmic developments are complemented by some promising numerical results obtained for solving a few important nonconvex and nonlinear data analysis problems in the literature.

OCJul 8, 2015
An optimal randomized incremental gradient method

Guanghui Lan, Yi Zhou

In this paper, we consider a class of finite-sum convex optimization problems whose objective function is given by the summation of $m$ ($\ge 1$) smooth components together with some other relatively simple terms. We first introduce a deterministic primal-dual gradient (PDG) method that can achieve the optimal black-box iteration complexity for solving these composite optimization problems using a primal-dual termination criterion. Our major contribution is to develop a randomized primal-dual gradient (RPDG) method, which needs to compute the gradient of only one randomly selected smooth component at each iteration, but can possibly achieve better complexity than PDG in terms of the total number of gradient evaluations. More specifically, we show that the total number of gradient evaluations performed by RPDG can be ${\cal O} (\sqrt{m})$ times smaller, both in expectation and with high probability, than those performed by deterministic optimal first-order methods under favorable situations. We also show that the complexity of the RPDG method is not improvable by developing a new lower complexity bound for a general class of randomized methods for solving large-scale finite-sum convex optimization problems. Moreover, through the development of PDG and RPDG, we introduce a novel game-theoretic interpretation for these optimal methods for convex optimization.

OCJun 4, 2014
Gradient Sliding for Composite Optimization

Guanghui Lan

We consider in this paper a class of composite optimization problems whose objective function is given by the summation of a general smooth and nonsmooth component, together with a relatively simple nonsmooth term. We present a new class of first-order methods, namely the gradient sliding algorithms, which can skip the computation of the gradient for the smooth component from time to time. As a consequence, these algorithms require only ${\cal O}(1/\sqrtε)$ gradient evaluations for the smooth component in order to find an $ε$-solution for the composite problem, while still maintaining the optimal ${\cal O}(1/ε^2)$ bound on the total number of subgradient evaluations for the nonsmooth component. We then present a stochastic counterpart for these algorithms and establish similar complexity bounds for solving an important class of stochastic composite optimization problems. Moreover, if the smooth component in the composite function is strongly convex, the developed gradient sliding algorithms can significantly reduce the number of graduate and subgradient evaluations for the smooth and nonsmooth component to ${\cal O} (\log (1/ε))$ and ${\cal O}(1/ε)$, respectively. Finally, we generalize these algorithms to the case when the smooth component is replaced by a nonsmooth one possessing a certain bi-linear saddle point structure.