Niao He

LG
h-index17
74papers
1,949citations
Novelty60%
AI Score62

74 Papers

20.7LGOct 14, 2023Code
DPZero: Private Fine-Tuning of Language Models without Backpropagation

Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil et al. · eth-zurich

The widespread practice of fine-tuning large language models (LLMs) on domain-specific data faces two major challenges in memory and privacy. First, as the size of LLMs continues to grow, the memory demands of gradient-based training methods via backpropagation become prohibitively high. Second, given the tendency of LLMs to memorize training data, it is important to protect potentially sensitive information in the fine-tuning data from being regurgitated. Zeroth-order methods, which rely solely on forward passes, substantially reduce memory consumption during training. However, directly combining them with standard differentially private gradient descent suffers more as model size grows. To bridge this gap, we introduce DPZero, a novel private zeroth-order algorithm with nearly dimension-independent rates. The memory efficiency of DPZero is demonstrated in privately fine-tuning RoBERTa and OPT on several downstream tasks. Our code is available at https://github.com/Liang137/DPZero.

18.1LGJun 1, 2022
Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization

Liang Zhang, Kiran Koshy Thekumparampil, Sewoong Oh et al. · eth-zurich

We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [22] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first near-linear time algorithms with near-optimal guarantees on the population duality gap for smooth DP-SMO, when the objective is (strongly-)convex--(strongly-)concave. Additionally, based on our flexible framework, we enrich the family of near-linear time algorithms for smooth DP-SCO with the near-optimal privacy-loss trade-off.

26.7LGFeb 3, 2023
Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies

Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva et al. · eth-zurich

Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.

10.4LGMay 11Code
Muown: Row-Norm Control for Muon Optimization

Kai Lion, Florian Hübler, Bingcong Li et al.

Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the $\ell_\infty$ geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic regimes under a dual norm aligned with the underlying geometries and with a stochastic noise coefficient that empirically remains below that of Muon throughout training. Across GPT-style pre-training on FineWeb-Edu with model sizes from 124M up to 2.7B parameters, Muown improves perplexity over Muon, SOAP, AdamW, and Lion. It also widens the plateau of near-optimal learning rates across model scales, reduces sensitivity to weight decay, and avoids the spectral norm drift at negligible step-time overhead when appropriately sharded.

15.5LGJun 2, 2023
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space

Anas Barakat, Ilyas Fatkhullin, Niao He · eth-zurich

We consider the reinforcement learning (RL) problem with general utilities which consists in maximizing a function of the state-action occupancy measure. Beyond the standard cumulative reward RL setting, this problem includes as particular cases constrained RL, pure exploration and learning from demonstrations among others. For this problem, we propose a simpler single-loop parameter-free normalized policy gradient algorithm. Implementing a recursive momentum variance reduction mechanism, our algorithm achieves $\tilde{\mathcal{O}}(ε^{-3})$ and $\tilde{\mathcal{O}}(ε^{-2})$ sample complexities for $ε$-first-order stationarity and $ε$-global optimality respectively, under adequate assumptions. We further address the setting of large finite state action spaces via linear function approximation of the occupancy measure and show a $\tilde{\mathcal{O}}(ε^{-4})$ sample complexity for a simple policy gradient method with a linear regression subroutine.

16.5LGJun 2, 2022
Finite-Time Analysis of Entropy-Regularized Neural Natural Actor-Critic Algorithm

Semih Cayci, Niao He, R. Srikant

Natural actor-critic (NAC) and its variants, equipped with the representation power of neural networks, have demonstrated impressive empirical success in solving Markov decision problems with large state spaces. In this paper, we present a finite-time analysis of NAC with neural network approximation, and identify the roles of neural networks, regularization and optimization techniques (e.g., gradient clipping and averaging) to achieve provably good performance in terms of sample complexity, iteration complexity and overparametrization bounds for the actor and the critic. In particular, we prove that (i) entropy regularization and averaging ensure stability by providing sufficient exploration to avoid near-deterministic and strictly suboptimal policies and (ii) regularization leads to sharp sample complexity and network width bounds in the regularized MDPs, yielding a favorable bias-variance tradeoff in policy optimization. In the process, we identify the importance of uniform approximation power of the actor neural network to achieve global optimality in policy optimization due to distributional shift.

12.1OCMay 28, 2022
Generalization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax Optimization

Siqi Zhang, Yifan Hu, Liang Zhang et al. · eth-zurich

This paper takes an initial step to systematically investigate the generalization bounds of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $ε$-generalization are $\tilde{\mathcal{O}}(dκ^2ε^{-2})$ and $\tilde{\mathcal{O}}(dε^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension and $κ$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between generalization bounds and the stability notion. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) algorithm and the more general sampling-determined algorithms.

17.7OCJun 1, 2022
Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization

Junchi Yang, Xiang Li, Niao He

Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(ε^{-2})$ and $\tilde{O}(ε^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.

15.6LGMay 25, 2022
Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function

Saeed Masiha, Saber Salehkaleybar, Niao He et al.

We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\leα\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $ε$-global optimum is $\mathcal{O}(ε^{-7/(2α)+1})$ for $1\leα< 3/2$ and $\mathcal{\tilde{O}}(ε^{-2/(α)})$ for $3/2\leα\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(ε^{-2})$ for $α=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.

18.7OCDec 29, 2022
Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games

Batuhan Yardim, Semih Cayci, Matthieu Geist et al.

Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games. However, limiting applicability, existing theoretical results assume variations of a "population generative model", which allows arbitrary modifications of the population distribution by the learning algorithm. Moreover, learning algorithms typically work on abstract simulators with population instead of the $N$-player game. Instead, we show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample trajectory without a population generative model, up to a standard $\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a divergent approach from the literature, instead of working with the best-response map we first show that a policy mirror ascent map can be used to construct a contractive operator having the Nash equilibrium as its fixed point. We analyze single-path TD learning for $N$-agent games, proving sample complexity guarantees by only using a sample path from the $N$-agent simulator without a population generative model. Furthermore, we demonstrate that our methodology allows for independent learning by $N$ agents with finite sample guarantees.

12.4LGMay 17, 2022
Momentum-Based Policy Gradient with Second-Order Information

Saber Salehkaleybar, Sadegh Khorasani, Negar Kiyavash et al.

Variance-reduced gradient estimators for policy gradient methods have been one of the main focus of research in the reinforcement learning in recent years as they allow acceleration of the estimation process. We propose a variance-reduced policy-gradient method, called SHARP, which incorporates second-order information into stochastic gradient descent (SGD) using momentum with a time-varying learning rate. SHARP algorithm is parameter-free, achieving $ε$-approximate first-order stationary point with $O(ε^{-3})$ number of trajectories, while using a batch size of $O(1)$ at each iteration. Unlike most previous work, our proposed algorithm does not require importance sampling which can compromise the advantage of variance reduction process. Moreover, the variance of estimation error decays with the fast rate of $O(1/t^{2/3})$ where $t$ is the number of iterations. Our extensive experimental evaluations show the effectiveness of the proposed algorithm on various control tasks and its advantage over the state of the art in practice.

11.2OCOct 31, 2022Code
TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization

Xiang Li, Junchi Yang, Niao He

Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.

11.5LGJun 26, 2023
On Imitation in Mean-field Games

Giorgia Ramponi, Pavel Kolev, Olivier Pietquin et al.

We explore the problem of imitation learning (IL) in the context of mean-field games (MFGs), where the goal is to imitate the behavior of a population of agents following a Nash equilibrium policy according to some unknown payoff function. IL in MFGs presents new challenges compared to single-agent IL, particularly when both the reward function and the transition kernel depend on the population distribution. In this paper, departing from the existing literature on IL for MFGs, we introduce a new solution concept called the Nash imitation gap. Then we show that when only the reward depends on the population distribution, IL in MFGs can be reduced to single-agent IL with similar guarantees. However, when the dynamics is population-dependent, we provide a novel upper-bound that suggests IL is harder in this setting. To address this issue, we propose a new adversarial formulation where the reinforcement learning problem is replaced by a mean-field control (MFC) problem, suggesting progress in IL within MFGs may have to build upon MFC.

8.9MLNov 14, 2022
Learning to Optimize with Stochastic Dominance Constraints

Hanjun Dai, Yuan Xue, Niao He et al.

In real-world decision-making, uncertainty is important yet difficult to handle. Stochastic dominance provides a theoretically sound approach for comparing uncertain quantities, but optimization with stochastic dominance constraints is often computationally expensive, which limits practical applicability. In this paper, we develop a simple yet efficient approach for the problem, the Light Stochastic Dominance Solver (light-SD), that leverages useful properties of the Lagrangian. We recast the inner optimization in the Lagrangian as a learning problem for surrogate approximation, which bypasses apparent intractability and leads to tractable updates or even closed-form solutions for gradient calculations. We prove convergence of the algorithm and test it empirically. The proposed light-SD demonstrates superior performance on several representative problems ranging from finance to supply chain management.

9.8LGJun 13, 2023
Provably Learning Nash Policies in Constrained Markov Potential Games

Pragnya Alatur, Giorgia Ramponi, Niao He et al.

Multi-agent reinforcement learning (MARL) addresses sequential decision-making problems with multiple agents, where each agent optimizes its own objective. In many real-world instances, the agents may not only want to optimize their objectives, but also ensure safe behavior. For example, in traffic routing, each car (agent) aims to reach its destination quickly (objective) while avoiding collisions (safety). Constrained Markov Games (CMGs) are a natural formalism for safe MARL problems, though generally intractable. In this work, we introduce and study Constrained Markov Potential Games (CMPGs), an important class of CMGs. We first show that a Nash policy for CMPGs can be found via constrained optimization. One tempting approach is to solve it by Lagrangian-based primal-dual methods. As we show, in contrast to the single-agent setting, however, CMPGs do not satisfy strong duality, rendering such approaches inapplicable and potentially unsafe. To solve the CMPG problem, we propose our algorithm Coordinate-Ascent for CMPGs (CA-CMPG), which provably converges to a Nash policy in tabular, finite-horizon CMPGs. Furthermore, we provide the first sample complexity bounds for learning Nash policies in unknown CMPGs, and, which under additional assumptions, guarantee safe exploration.

5.6OCAug 3, 2024
Optimal Local Convergence Rates of Stochastic First-Order Methods under Local $α$-PL

Saeed Masiha, Saber Salehkaleybar, Niao He et al.

We study the local convergence rate of stochastic first-order methods under a local $α$-Polyak-Lojasiewicz ($α$-PL) condition in a neighborhood of a target connected component $\mathcal{M}$ of the local minimizer set. The parameter $α\in [1,2]$ is the exponent of the gradient norm in the $α$-PL inequality: $α=2$ recovers the classical PL case, $α=1$ corresponds to Holder-type error bounds, and intermediate values interpolate between these regimes. Our performance criterion is the number of oracle queries required to output $\hat{x}$ with $F(\hat{x})-l \le \varepsilon$, where $l := F(y)$ for any $y \in \mathcal{M}$. We work in a local regime where the algorithm is initialized near $\mathcal{M}$ and, with high probability, its iterates remain in that neighborhood. We establish a lower bound $Ω(\varepsilon^{-2/α})$ for all stochastic first-order methods in this regime, and we obtain a matching upper bound $\mathcal{O}(\varepsilon^{-2/α})$ for $1 \le α< 2$ via a SARAH-type variance-reduced method with time-varying batch sizes and step sizes. In the convex setting, assuming a local $α$-PL condition on the $\varepsilon$-sublevel set, we further show a complexity lower bound $\widetildeΩ(\varepsilon^{-2/α})$ for reaching an $\varepsilon$-global optimum, matching the $\varepsilon$-dependence of known accelerated stochastic subgradient methods.

5.1SYSep 8, 2023Code
Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity and Last-Iterate Convergence

Jiduan Wu, Anas Barakat, Ilyas Fatkhullin et al. · eth-zurich

Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and can be used (i)~as a dynamic game formulation for risk-sensitive or robust control and (ii)~as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces. In contrast to the well-studied single-agent linear quadratic regulator problem, zero-sum LQ games entail solving a challenging nonconvex-nonconcave min-max problem with an objective function that lacks coercivity. Recently, Zhang et al. showed that an~$ε$-Nash equilibrium (NE) of finite horizon zero-sum LQ games can be learned via nested model-free Natural Policy Gradient (NPG) algorithms with poly$(1/ε)$ sample complexity. In this work, we propose a simpler nested Zeroth-Order (ZO) algorithm improving sample complexity by several orders of magnitude and guaranteeing convergence of the last iterate. Our main results are two-fold: (i) in the deterministic setting, we establish the first global last-iterate linear convergence result for the nested algorithm that seeks NE of zero-sum LQ games; (ii) in the model-free setting, we establish a~$\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity using a single-point ZO estimator. For our last-iterate convergence results, our analysis leverages the Implicit Regularization (IR) property and a new gradient domination condition for the primal function. Our key improvements in the sample complexity rely on a more sample-efficient nested algorithm design and a finer control of the ZO natural gradient estimation error utilizing the structure endowed by the finite-horizon setting.

10.7LGMay 13
Support Before Frequency in Discrete Diffusion

Adrian Müller, Antoine Gonon, Zebang Shen et al.

Discrete diffusion models are increasingly competitive for language modeling, yet it remains unclear how their denoising objectives organize learning. Although these objectives target the full data distribution, we show that the exact reverse process induces a hierarchy between coarse support information and finer frequency information. For uniform and absorbing (a.k.a. masking) diffusion, we prove that, in the small-noise regime of the final denoising steps, each single-token reverse edit decomposes into a leading scale, determined by whether it moves toward the data support (e.g., grammatically valid sentences), and a finer coefficient, determining relative probabilities within the same scale. Thus, recovering validity structure only requires learning the correct order of magnitude of reverse probabilities, whereas recovering data frequencies requires coefficient-level estimation. The separation is mechanism-dependent: uniform diffusion exhibits a trichotomy into validity-improving, validity-preserving, and validity-worsening edits, while absorbing diffusion places its leading-order mass on validity-improving moves. Experiments on a masked language diffusion model and synthetic regular-language tasks support these predictions: support-localization emerges earlier than within-support frequency ranking, and the contrast between uniform and absorbing diffusion matches the predicted rate separation. Together, our results suggest that discrete diffusion models learn data support before data frequencies.

9.8LGJun 12, 2023
Cancellation-Free Regret Bounds for Lagrangian Approaches in Constrained Markov Decision Processes

Adrian Müller, Pragnya Alatur, Giorgia Ramponi et al.

Constrained Markov Decision Processes (CMDPs) are one of the common ways to model safe reinforcement learning problems, where constraint functions model the safety objectives. Lagrangian-based dual or primal-dual algorithms provide efficient methods for learning in CMDPs. For these algorithms, the currently known regret bounds in the finite-horizon setting allow for a "cancellation of errors"; one can compensate for a constraint violation in one episode with a strict constraint satisfaction in another. However, we do not consider such a behavior safe in practical applications. In this paper, we overcome this weakness by proposing a novel model-based dual algorithm OptAug-CMDP for tabular finite-horizon CMDPs. Our algorithm is motivated by the augmented Lagrangian method and can be performed efficiently. We show that during $K$ episodes of exploring the CMDP, our algorithm obtains a regret of $\tilde{O}(\sqrt{K})$ for both the objective and the constraint violation. Unlike existing Lagrangian approaches, our algorithm achieves this regret without the need for the cancellation of errors.

9.8LGJun 25, 2023
Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Jun Song, Niao He, Lijun Ding et al.

Trust-region methods based on Kullback-Leibler divergence are pervasively used to stabilize policy optimization in reinforcement learning. In this paper, we exploit more flexible metrics and examine two natural extensions of policy optimization with Wasserstein and Sinkhorn trust regions, namely Wasserstein policy optimization (WPO) and Sinkhorn policy optimization (SPO). Instead of restricting the policy to a parametric distribution class, we directly optimize the policy distribution and derive their closed-form policy updates based on the Lagrangian duality. Theoretically, we show that WPO guarantees a monotonic performance improvement, and SPO provably converges to WPO as the entropic regularizer diminishes. Moreover, we prove that with a decaying Lagrangian multiplier to the trust region constraint, both methods converge to global optimality. Experiments across tabular domains, robotic locomotion, and continuous control tasks further demonstrate the performance improvement of both approaches, more robustness of WPO to sample insufficiency, and faster convergence of SPO, over state-of-art policy gradient methods.

5.3LGOct 26, 2023
Optimal Guarantees for Algorithmic Reproducibility and Gradient Complexity in Convex Optimization

Liang Zhang, Junchi Yang, Amin Karbasi et al. · eth-zurich

Algorithmic reproducibility measures the deviation in outputs of machine learning algorithms upon minor changes in the training process. Previous work suggests that first-order methods would need to trade-off convergence rate (gradient complexity) for better reproducibility. In this work, we challenge this perception and demonstrate that both optimal reproducibility and near-optimal convergence guarantees can be achieved for smooth convex minimization and smooth convex-concave minimax problems under various error-prone oracle settings. Particularly, given the inexact initialization oracle, our regularization-based algorithms achieve the best of both worlds - optimal reproducibility and near-optimal gradient complexity - for minimization and minimax optimization. With the inexact gradient oracle, the near-optimal guarantees also hold for minimax optimization. Additionally, with the stochastic gradient oracle, we show that stochastic gradient descent ascent is optimal in terms of both reproducibility and gradient complexity. We believe our results contribute to an enhanced understanding of the reproducibility-convergence trade-off in the context of convex optimization.

5.9MLFeb 26, 2023Code
Kernel Conditional Moment Constraints for Confounding Robust Inference

Kei Ishikawa, Niao He

We study policy evaluation of offline contextual bandits subject to unobserved confounders. Sensitivity analysis methods are commonly used to estimate the policy value under the worst-case confounding over a given uncertainty set. However, existing work often resorts to some coarse relaxation of the uncertainty set for the sake of tractability, leading to overly conservative estimation of the policy value. In this paper, we propose a general estimator that provides a sharp lower bound of the policy value. It can be shown that our estimator contains the recently proposed sharp estimator by Dorn and Guo (2022) as a special case, and our method enables a novel extension of the classical marginal sensitivity model using f-divergence. To construct our estimator, we leverage the kernel method to obtain a tractable approximation to the conditional moment constraints, which traditional non-sharp estimators failed to take into account. In the theoretical analysis, we provide a condition for the choice of the kernel which guarantees no specification error that biases the lower bound estimation. Furthermore, we provide consistency guarantees of policy evaluation and learning. In the experiments with synthetic and real-world data, we demonstrate the effectiveness of the proposed method.

3.8LGNov 15, 2023
Efficiently Escaping Saddle Points for Policy Optimization

Sadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash et al.

Policy gradient (PG) is widely used in reinforcement learning due to its scalability and good performance. In recent years, several variance-reduced PG methods have been proposed with a theoretical guarantee of converging to an approximate first-order stationary point (FOSP) with the sample complexity of $O(ε^{-3})$. However, FOSPs could be bad local optima or saddle points. Moreover, these algorithms often use importance sampling (IS) weights which could impair the statistical effectiveness of variance reduction. In this paper, we propose a variance-reduced second-order method that uses second-order information in the form of Hessian vector products (HVP) and converges to an approximate second-order stationary point (SOSP) with sample complexity of $\tilde{O}(ε^{-3})$. This rate improves the best-known sample complexity for achieving approximate SOSPs by a factor of $O(ε^{-0.5})$. Moreover, the proposed variance reduction technique bypasses IS weights by using HVP terms. Our experimental results show that the proposed algorithm outperforms the state of the art and is more robust to changes in random seeds.

22.9OCNov 6, 2023
Parameter-Agnostic Optimization under Relaxed Smoothness

Florian Hübler, Junchi Yang, Xiang Li et al.

Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.

3.8LGFeb 10, 2023Code
Robust Knowledge Transfer in Tiered Reinforcement Learning

Jiawei Huang, Niao He

In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the ``Optimal Value Dominance'' for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.

7.9LGJul 14, 2024Code
Learning to Steer Markovian Agents under Model Uncertainty

Jiawei Huang, Vinzenz Thoma, Zebang Shen et al.

Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents' underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of learning dynamics called \emph{Markovian agents}. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emph{history-dependent} steering strategy to handle the inherent model uncertainty about the agents' learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations.

3.3GTAug 27, 2024
Exploiting Approximate Symmetry for Efficient Multi-Agent Reinforcement Learning

Batuhan Yardim, Niao He

Mean-field games (MFG) have become significant tools for solving large-scale multi-agent reinforcement learning problems under symmetry. However, the assumption of exact symmetry limits the applicability of MFGs, as real-world scenarios often feature inherent heterogeneity. Furthermore, most works on MFG assume access to a known MFG model, which might not be readily available for real-world finite-agent games. In this work, we broaden the applicability of MFGs by providing a methodology to extend any finite-player, possibly asymmetric, game to an "induced MFG". First, we prove that $N$-player dynamic games can be symmetrized and smoothly extended to the infinite-player continuum via explicit Kirszbraun extensions. Next, we propose the notion of $α,β$-symmetric games, a new class of dynamic population games that incorporate approximate permutation invariance. For $α,β$-symmetric games, we establish explicit approximation bounds, demonstrating that a Nash policy of the induced MFG is an approximate Nash of the $N$-player dynamic game. We show that TD learning converges up to a small bias using trajectories of the $N$-player game with finite-sample guarantees, permitting symmetrized learning without building an explicit MFG model. Finally, for certain games satisfying monotonicity, we prove a sample complexity of $\widetilde{\mathcal{O}}(\varepsilon^{-6})$ for the $N$-agent game to learn an $\varepsilon$-Nash up to symmetrization bias. Our theory is supported by evaluations on MARL benchmarks with thousands of agents.

11.5LGAug 15, 2024
Independent Policy Mirror Descent for Markov Potential Games: Scaling to Large Number of Players

Pragnya Alatur, Anas Barakat, Niao He

Markov Potential Games (MPGs) form an important sub-class of Markov games, which are a common framework to model multi-agent reinforcement learning problems. In particular, MPGs include as a special case the identical-interest setting where all the agents share the same reward function. Scaling the performance of Nash equilibrium learning algorithms to a large number of agents is crucial for multi-agent systems. To address this important challenge, we focus on the independent learning setting where agents can only have access to their local information to update their own policy. In prior work on MPGs, the iteration complexity for obtaining $ε$-Nash regret scales linearly with the number of agents $N$. In this work, we investigate the iteration complexity of an independent policy mirror descent (PMD) algorithm for MPGs. We show that PMD with KL regularization, also known as natural policy gradient, enjoys a better $\sqrt{N}$ dependence on the number of agents, improving over PMD with Euclidean regularization and prior work. Furthermore, the iteration complexity is also independent of the sizes of the agents' action spaces.

12.2OCAug 20, 2024
Multi-level Monte-Carlo Gradient Methods for Stochastic Optimization with Biased Oracles

Yifan Hu, Jie Wang, Xin Chen et al.

We consider stochastic optimization when one only has access to biased stochastic oracles of the objective and the gradient, and obtaining stochastic gradients with low biases comes at high costs. This setting captures various optimization paradigms, such as conditional stochastic optimization, distributionally robust optimization, shortfall risk optimization, and machine learning paradigms, such as contrastive learning. We examine a family of multi-level Monte Carlo (MLMC) gradient methods that exploit a delicate tradeoff among bias, variance, and oracle cost. We systematically study their total sample and computational complexities for strongly convex, convex, and nonconvex objectives and demonstrate their superiority over the widely used biased stochastic gradient method. When combined with the variance reduction techniques like SPIDER, these MLMC gradient methods can further reduce the complexity in the nonconvex regime. Our results imply that a series of stochastic optimization problems with biased oracles, previously considered to be more challenging, is fundamentally no harder than the classical stochastic optimization with unbiased oracles. We also delineate the boundary conditions under which these problems become more difficult. Moreover, MLMC gradient methods significantly improve the best-known complexities in the literature for conditional stochastic optimization and shortfall risk optimization. Our extensive numerical experiments on distributionally robust optimization, pricing and staffing scheduling problems, and contrastive learning demonstrate the superior performance of MLMC gradient methods.

9.4OCNov 13, 2025
Global Solutions to Non-Convex Functional Constrained Problems with Hidden Convexity

Ilyas Fatkhullin, Niao He, Guanghui Lan et al.

Constrained non-convex optimization is fundamentally challenging, as global solutions are generally intractable and constraint qualifications may not hold. However, in many applications, including safe policy optimization in control and reinforcement learning, such problems possess hidden convexity, meaning they can be reformulated as convex programs via a nonlinear invertible transformation. Typically such transformations are implicit or unknown, making the direct link with the convex program impossible. On the other hand, (sub-)gradients with respect to the original variables are often accessible or can be easily estimated, which motivates algorithms that operate directly in the original (non-convex) problem space using standard (sub-)gradient oracles. In this work, we develop the first algorithms to provably solve such non-convex problems to global minima. First, using a modified inexact proximal point method, we establish global last-iterate convergence guarantees with $\widetilde{\mathcal{O}}(\varepsilon^{-3})$ oracle complexity in non-smooth setting. For smooth problems, we propose a new bundle-level type method based on linearly constrained quadratic subproblems, improving the oracle complexity to $\widetilde{\mathcal{O}}(\varepsilon^{-1})$. Surprisingly, despite non-convexity, our methodology does not require any constraint qualifications, can handle hidden convex equality constraints, and achieves complexities matching those for solving unconstrained hidden convex optimization.

20.7LGFeb 24, 2024
Truly No-Regret Learning in Constrained MDPs

Adrian Müller, Pragnya Alatur, Volkan Cevher et al.

Constrained Markov decision processes (CMDPs) are a common way to model safety constraints in reinforcement learning. State-of-the-art methods for efficiently solving CMDPs are based on primal-dual algorithms. For these algorithms, all currently known regret bounds allow for error cancellations -- one can compensate for a constraint violation in one round with a strict constraint satisfaction in another. This makes the online learning process unsafe since it only guarantees safety for the final (mixture) policy but not during learning. As Efroni et al. (2020) pointed out, it is an open question whether primal-dual algorithms can provably achieve sublinear regret if we do not allow error cancellations. In this paper, we give the first affirmative answer. We first generalize a result on last-iterate convergence of regularized primal-dual schemes to CMDPs with multiple constraints. Building upon this insight, we propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.

15.3OCFeb 27, 2024
Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence

Ilyas Fatkhullin, Niao He · eth-zurich

This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the contemporary nonconvex optimization setting. Existing results for batch-free nonconvex SMD restrict the choice of the distance generating function (DGF) to be differentiable with Lipschitz continuous gradients, thereby excluding important setups such as Shannon entropy. In this work, we present a new convergence analysis of nonconvex SMD supporting general DGF, that overcomes the above limitations and relies solely on the standard assumptions. Moreover, our convergence is established with respect to the Bregman Forward-Backward envelope, which is a stronger measure than the commonly used squared norm of gradient mapping. We further extend our results to guarantee high probability convergence under sub-Gaussian noise and global convergence under the generalized Bregman Proximal Polyak-Łojasiewicz condition. Additionally, we illustrate the advantages of our improved SMD theory in various nonconvex machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of nonconvex differentially private (DP) learning, our theory yields a simple algorithm with a (nearly) dimension-independent utility bound. For the problem of training linear neural networks, we develop provably convergent stochastic algorithms.

15.0LGOct 18, 2024
Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems

Bingcong Li, Liang Zhang, Niao He · eth-zurich

Sharpness-aware minimization (SAM) improves generalization of various deep learning tasks. Motivated by popular architectures such as LoRA, we explore the implicit regularization of SAM for scale-invariant problems involving two groups of variables. Instead of focusing on commonly used sharpness, this work introduces a concept termed balancedness, defined as the difference between the squared norm of two variables. This allows us to depict richer global behaviors of SAM. In particular, our theoretical and empirical findings reveal that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive -- outliers have stronger impact. The latter coincides with empirical observations that SAM outperforms SGD in the presence of outliers. Leveraging the implicit regularization, we develop a resource-efficient SAM variant, balancedness-aware regularization (BAR), tailored for scale-invariant problems such as finetuning language models with LoRA. BAR saves 95% computational overhead of SAM, with enhanced test performance across various tasks on RoBERTa, GPT2, and OPT-1.3B.

12.5LGFeb 8, 2024Code
Model-Based RL for Mean-Field Games is not Statistically Harder than Single-Agent RL

Jiawei Huang, Niao He, Andreas Krause

We study the sample complexity of reinforcement learning (RL) in Mean-Field Games (MFGs) with model-based function approximation that requires strategic exploration to find a Nash Equilibrium policy. We introduce the Partial Model-Based Eluder Dimension (P-MBED), a more effective notion to characterize the model class complexity. Notably, P-MBED measures the complexity of the single-agent model class converted from the given mean-field model class, and potentially, can be exponentially lower than the MBED proposed by \citet{huang2023statistical}. We contribute a model elimination algorithm featuring a novel exploration strategy and establish sample complexity results polynomial w.r.t.~P-MBED. Crucially, our results reveal that, under the basic realizability and Lipschitz continuity assumptions, \emph{learning Nash Equilibrium in MFGs is no more statistically challenging than solving a logarithmic number of single-agent RL problems}. We further extend our results to Multi-Type MFGs, generalizing from conventional MFGs and involving multiple types of agents. This extension implies statistical tractability of a broader class of Markov Games through the efficacy of mean-field approximation. Finally, inspired by our theoretical algorithm, we present a heuristic approach with improved computational efficiency and empirically demonstrate its effectiveness.

21.3LGJun 18, 2025
Provable Maximum Entropy Manifold Exploration via Diffusion Models

Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh et al.

Exploration is critical for solving real-world decision-making problems such as scientific discovery, where the objective is to generate truly novel designs rather than mimic existing data distributions. In this work, we address the challenge of leveraging the representational power of generative models for exploration without relying on explicit uncertainty quantification. We introduce a novel framework that casts exploration as entropy maximization over the approximate data manifold implicitly defined by a pre-trained diffusion model. Then, we present a novel principle for exploration based on density estimation, a problem well-known to be challenging in practice. To overcome this issue and render this method truly scalable, we leverage a fundamental connection between the entropy of the density induced by a diffusion model and its score function. Building on this, we develop an algorithm based on mirror descent that solves the exploration problem as sequential fine-tuning of a pre-trained diffusion model. We prove its convergence to the optimal exploratory diffusion model under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we empirically evaluate our approach on both synthetic and high-dimensional text-to-image diffusion, demonstrating promising results.

14.4LGJun 5, 2025
Zeroth-Order Optimization Finds Flat Minima

Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil et al. · eth-zurich

Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known on the implicit regularization that provides a fine-grained characterization on which particular solutions are finally reached. We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, which is widely used in previous work to distinguish between sharp and flat minima. We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, where flat minima are defined as the minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support our theoretical findings.

4.1LGMar 12, 2025
Steering No-Regret Agents in MFGs under Model Uncertainty

Leo Widmer, Jiawei Huang, Niao He

Incentive design is a popular framework for guiding agents' learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents' behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.

17.9LGNov 27, 2025
Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning

Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh et al.

Adapting large-scale foundation flow and diffusion generative models to optimize task-specific objectives while preserving prior information is crucial for real-world applications such as molecular design, protein docking, and creative image generation. Existing principled fine-tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre-trained model via KL-divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk-averse and novelty-seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL-divergence, such as optimal transport distances and Renyi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text-to-image, and molecular design tasks, showing that it can steer pre-trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine-tuning schemes.

2.3GTOct 24, 2025
Scalable Neural Incentive Design with Parameterized Mean-Field Approximation

Nathan Corecco, Batuhan Yardim, Vinzenz Thoma et al.

Designing incentives for a multi-agent system to induce a desirable Nash equilibrium is both a crucial and challenging problem appearing in many decision-making domains, especially for a large number of agents $N$. Under the exchangeability assumption, we formalize this incentive design (ID) problem as a parameterized mean-field game (PMFG), aiming to reduce complexity via an infinite-population limit. We first show that when dynamics and rewards are Lipschitz, the finite-$N$ ID objective is approximated by the PMFG at rate $\mathscr{O}(\frac{1}{\sqrt{N}})$. Moreover, beyond the Lipschitz-continuous setting, we prove the same $\mathscr{O}(\frac{1}{\sqrt{N}})$ decay for the important special case of sequential auctions, despite discontinuities in dynamics, through a tailored auction-specific analysis. Built on our novel approximation results, we further introduce our Adjoint Mean-Field Incentive Design (AMID) algorithm, which uses explicit differentiation of iterated equilibrium operators to compute gradients efficiently. By uniting approximation bounds with optimization guarantees, AMID delivers a powerful, scalable algorithmic tool for many-agent (large $N$) ID. Across diverse auction settings, the proposed AMID method substantially increases revenue over first-price formats and outperforms existing benchmark methods.

9.4LGOct 22, 2025
Natural Gradient VI: Guarantees for Non-Conjugate Models

Fangyuan Sun, Ilyas Fatkhullin, Niao He

Stochastic Natural Gradient Variational Inference (NGVI) is a widely used method for approximating posterior distribution in probabilistic models. Despite its empirical success and foundational role in variational inference, its theoretical underpinnings remain limited, particularly in the case of non-conjugate likelihoods. While NGVI has been shown to be a special instance of Stochastic Mirror Descent, and recent work has provided convergence guarantees using relative smoothness and strong convexity for conjugate models, these results do not extend to the non-conjugate setting, where the variational loss becomes non-convex and harder to analyze. In this work, we focus on mean-field parameterization and advance the theoretical understanding of NGVI in three key directions. First, we derive sufficient conditions under which the variational loss satisfies relative smoothness with respect to a suitable mirror map. Second, leveraging this structure, we propose a modified NGVI algorithm incorporating non-Euclidean projections and prove its global non-asymptotic convergence to a stationary point. Finally, under additional structural assumptions about the likelihood, we uncover hidden convexity properties of the variational loss and establish fast global convergence of NGVI to a global optimum. These results provide new insights into the geometry and convergence behavior of NGVI in challenging inference settings.

13.0LGSep 27, 2025
Landing with the Score: Riemannian Optimization through Denoising

Andrey Kharitenko, Zebang Shen, Riccardo de Santi et al.

Under the data manifold hypothesis, high-dimensional data are concentrated near a low-dimensional manifold. We study the problem of Riemannian optimization over such manifolds when they are given only implicitly through the data distribution, and the standard manifold operations required by classical algorithms are unavailable. This formulation captures a broad class of data-driven design problems that are central to modern generative AI. Our key idea is to introduce a link function that connects the data distribution to the geometric operations needed for optimization. We show that this function enables the recovery of essential manifold operations, such as retraction and Riemannian gradient computation. Moreover, we establish a direct connection between our construction and the score function in diffusion models of the data distribution. This connection allows us to leverage well-studied parameterizations, efficient training procedures, and even pretrained score networks from the diffusion model literature to perform optimization. Building on this foundation, we propose two efficient inference-time algorithms -- Denoising Landing Flow (DLF) and Denoising Riemannian Gradient Descent (DRGD) -- and provide theoretical guarantees for both feasibility (approximate manifold adherence) and optimality (small Riemannian gradient norm). Finally, we demonstrate the effectiveness of our approach on finite-horizon reference tracking tasks in data-driven control, highlighting its potential for practical generative and design applications.

4.1LGFeb 8, 2025
Poincaré Inequality for Local Log-Polyak-Lojasiewicz Measures : Non-asymptotic Analysis in Low-temperature Regime

Yun Gong, Zebang Shen, Niao He

Potential functions in highly pertinent applications, such as deep learning in over-parameterized regime, are empirically observed to admit non-isolated minima. To understand the convergence behavior of stochastic dynamics in such landscapes, we propose to study the class of \logPLmeasure\ measures $μ_ε\propto \exp(-V/ε)$, where the potential $V$ satisfies a local Polyak-Łojasiewicz (PŁ) inequality, and its set of local minima is provably \emph{connected}. Notably, potentials in this class can exhibit local maxima and we characterize its optimal set S to be a compact $\mathcal{C}^2$ \emph{embedding submanifold} of $\mathbb{R}^d$ without boundary. The \emph{non-contractibility} of S distinguishes our function class from the classical convex setting topologically. Moreover, the embedding structure induces a naturally defined Laplacian-Beltrami operator on S, and we show that its first non-trivial eigenvalue provides an \emph{$ε$-independent} lower bound for the \Poincare\ constant in the \Poincare\ inequality of $μ_ε$. As a direct consequence, Langevin dynamics with such non-convex potential $V$ and diffusion coefficient $ε$ converges to its equilibrium $μ_ε$ at a rate of $\tilde{\mathcal{O}}(1/ε)$, provided $ε$ is sufficiently small. Here $\tilde{\mathcal{O}}$ hides logarithmic terms.

5.6OCJun 5, 2024
Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes

Yan Huang, Xiang Li, Yipeng Shen et al.

In this paper, we show that applying adaptive methods directly to distributed minimax problems can result in non-convergence due to inconsistency in locally computed adaptive stepsizes. To address this challenge, we propose D-AdaST, a Distributed Adaptive minimax method with Stepsize Tracking. The key strategy is to employ an adaptive stepsize tracking protocol involving the transmission of two extra (scalar) variables. This protocol ensures the consistency among stepsizes of nodes, eliminating the steady-state error due to the lack of coordination of stepsizes among nodes that commonly exists in vanilla distributed adaptive methods, and thus guarantees exact convergence. For nonconvex-strongly-concave distributed minimax problems, we characterize the specific transient times that ensure time-scale separation of stepsizes and quasi-independence of networks, leading to a near-optimal convergence rate of $\tilde{\mathcal{O}} \left( ε^{-\left( 4+δ\right)} \right)$ for any small $δ> 0$, matching that of the centralized counterpart. To our best knowledge, D-AdaST is the first distributed adaptive method achieving near-optimal convergence without knowing any problem-dependent parameters for nonconvex minimax problems. Extensive experiments are conducted to validate our theoretical results.

9.4OCMar 19, 2024
Primal Methods for Variational Inequality Problems with Functional Constraints

Liang Zhang, Niao He, Michael Muehlebach

Variational inequality problems are recognized for their broad applications across various fields including machine learning and operations research. First-order methods have emerged as the standard approach for solving these problems due to their simplicity and scalability. However, they typically rely on projection or linear minimization oracles to navigate the feasible set, which becomes computationally expensive in practical scenarios featuring multiple functional constraints. Existing efforts to tackle such functional constrained variational inequality problems have centered on primal-dual algorithms grounded in the Lagrangian function. These algorithms along with their theoretical analysis often require the existence and prior knowledge of the optimal Lagrange multipliers. In this work, we propose a simple primal method, termed Constrained Gradient Method (CGM), for addressing functional constrained variational inequality problems, without requiring any information on the optimal Lagrange multipliers. We establish a non-asymptotic convergence analysis of the algorithm for Minty variational inequality problems with monotone operators under smooth constraints. Remarkably, our algorithms match the complexity of projection-based methods in terms of operator queries for both monotone and strongly monotone settings, while using significantly cheaper oracles based on quadratic programming. Furthermore, we provide several numerical examples to evaluate the efficacy of our algorithms.

20.6OCMay 21, 2023
Two Sides of One Coin: the Limits of Untuned SGD and the Power of Adaptive Methods

Junchi Yang, Xiang Li, Ilyas Fatkhullin et al.

The classical analysis of Stochastic Gradient Descent (SGD) with polynomially decaying stepsize $η_t = η/\sqrt{t}$ relies on well-tuned $η$ depending on problem parameters such as Lipschitz smoothness constant, which is often unknown in practice. In this work, we prove that SGD with arbitrary $η> 0$, referred to as untuned SGD, still attains an order-optimal convergence rate $\widetilde{O}(T^{-1/4})$ in terms of gradient norm for minimizing smooth objectives. Unfortunately, it comes at the expense of a catastrophic exponential dependence on the smoothness constant, which we show is unavoidable for this scheme even in the noiseless setting. We then examine three families of adaptive methods $\unicode{x2013}$ Normalized SGD (NSGD), AMSGrad, and AdaGrad $\unicode{x2013}$ unveiling their power in preventing such exponential dependency in the absence of information about the smoothness parameter and boundedness of stochastic gradients. Our results provide theoretical justification for the advantage of adaptive methods over untuned SGD in alleviating the issue with large gradients.

13.0LGMay 18, 2023
On the Statistical Efficiency of Mean-Field Reinforcement Learning with General Function Approximation

Jiawei Huang, Batuhan Yardim, Niao He

In this paper, we study the fundamental statistical efficiency of Reinforcement Learning in Mean-Field Control (MFC) and Mean-Field Game (MFG) with general model-based function approximation. We introduce a new concept called Mean-Field Model-Based Eluder Dimension (MF-MBED), which characterizes the inherent complexity of mean-field model classes. We show that a rich family of Mean-Field RL problems exhibits low MF-MBED. Additionally, we propose algorithms based on maximal likelihood estimation, which can return an $ε$-optimal policy for MFC or an $ε$-Nash Equilibrium policy for MFG. The overall sample complexity depends only polynomially on MF-MBED, which is potentially much lower than the size of state-action space. Compared with previous works, our results only require the minimal assumptions including realizability and Lipschitz continuity.

11.1LGFeb 20, 2022
Finite-Time Analysis of Natural Actor-Critic for POMDPs

Semih Cayci, Niao He, R. Srikant

We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.

12.1OCJan 19, 2022
Lifted Primal-Dual Method for Bilinearly Coupled Smooth Minimax Optimization

Kiran Koshy Thekumparampil, Niao He, Sewoong Oh

We study the bilinearly coupled minimax problem: $\min_{x} \max_{y} f(x) + y^\top A x - h(y)$, where $f$ and $h$ are both strongly convex smooth functions and admit first-order gradient oracles. Surprisingly, no known first-order algorithms have hitherto achieved the lower complexity bound of $Ω((\sqrt{\frac{L_x}{μ_x}} + \frac{\|A\|}{\sqrt{μ_x μ_y}} + \sqrt{\frac{L_y}{μ_y}}) \log(\frac1{\varepsilon}))$ for solving this problem up to an $\varepsilon$ primal-dual gap in the general parameter regime, where $L_x, L_y,μ_x,μ_y$ are the corresponding smoothness and strongly convexity constants. We close this gap by devising the first optimal algorithm, the Lifted Primal-Dual (LPD) method. Our method lifts the objective into an extended form that allows both the smooth terms and the bilinear term to be handled optimally and seamlessly with the same primal-dual framework. Besides optimality, our method yields a desirably simple single-loop algorithm that uses only one gradient oracle call per iteration. Moreover, when $f$ is just convex, the same algorithm applied to a smoothed objective achieves the nearly optimal iteration complexity. We also provide a direct single-loop algorithm, using the LPD method, that achieves the iteration complexity of $O(\sqrt{\frac{L_x}{\varepsilon}} + \frac{\|A\|}{\sqrt{μ_y \varepsilon}} + \sqrt{\frac{L_y}{\varepsilon}})$. Numerical experiments on quadratic minimax problems and policy evaluation problems further demonstrate the fast convergence of our algorithm in practice.

21.0LGDec 10, 2021Code
Faster Single-loop Algorithms for Minimax Optimization without Strong Concavity

Junchi Yang, Antonio Orvieto, Aurelien Lucchi et al.

Gradient descent ascent (GDA), the simplest single-loop algorithm for nonconvex minimax optimization, is widely used in practical applications such as generative adversarial networks (GANs) and adversarial training. Albeit its desirable simplicity, recent work shows inferior convergence rates of GDA in theory even assuming strong concavity of the objective on one side. This paper establishes new convergence results for two alternative single-loop algorithms -- alternating GDA and smoothed GDA -- under the mild assumption that the objective satisfies the Polyak-Lojasiewicz (PL) condition about one variable. We prove that, to find an $ε$-stationary point, (i) alternating GDA and its stochastic variant (without mini batch) respectively require $O(κ^{2} ε^{-2})$ and $O(κ^{4} ε^{-4})$ iterations, while (ii) smoothed GDA and its stochastic variant (without mini batch) respectively require $O(κε^{-2})$ and $O(κ^{2} ε^{-4})$ iterations. The latter greatly improves over the vanilla GDA and gives the hitherto best known complexity results among single-loop algorithms under similar settings. We further showcase the empirical efficiency of these algorithms in training GANs and robust nonlinear regression.

18.9LGJun 8, 2021
Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation

Semih Cayci, Niao He, R. Srikant

Natural policy gradient (NPG) methods with entropy regularization achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, their convergence properties and the impact of entropy regularization remain elusive in the function approximation regime. In this paper, we establish finite-time convergence analyses of entropy-regularized NPG with linear function approximation under softmax parameterization. In particular, we prove that entropy-regularized NPG with averaging satisfies the \emph{persistence of excitation} condition, and achieves a fast convergence rate of $\tilde{O}(1/T)$ up to a function approximation error in regularized Markov decision processes. This convergence result does not require any a priori assumptions on the policies. Furthermore, under mild regularity conditions on the concentrability coefficient and basis vectors, we prove that entropy-regularized NPG exhibits \emph{linear convergence} up to a function approximation error.