LGJun 1, 2022
Byzantine-Robust Online and Offline Distributed Reinforcement LearningYiding Chen, Xuezhou Zhang, Kaiqing Zhang et al.
We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $α$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
GTNov 1, 2023Code
Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and ValueYoung Wu, Jeremy McMahan, Yiding Chen et al.
We study the game modification problem, where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost. We characterize the set of policy profiles that can be installed as the unique equilibrium of a game and establish sufficient and necessary conditions for successful installation. We propose an efficient algorithm that solves a convex optimization problem with linear constraints and then performs random perturbation to obtain a modification plan with a near-optimal cost. The code for our algorithm is available at https://github.com/YoungWu559/game-modification .
LGJul 20, 2022
The Game of Hidden Rules: A New Kind of Benchmark Challenge for Machine LearningEric Pulick, Shubham Bharti, Yiding Chen et al.
As machine learning (ML) is more tightly woven into society, it is imperative that we better characterize ML's strengths and limitations if we are to employ it responsibly. Existing benchmark environments for ML, such as board and video games, offer well-defined benchmarks for progress, but constituent tasks are often complex, and it is frequently unclear how task characteristics contribute to overall difficulty for the machine learner. Likewise, without a systematic assessment of how task characteristics influence difficulty, it is challenging to draw meaningful connections between performance in different benchmark environments. We introduce a novel benchmark environment that offers an enormous range of ML challenges and enables precise examination of how task elements influence practical difficulty. The tool frames learning tasks as a "board-clearing game," which we call the Game of Hidden Rules (GOHR). The environment comprises an expressive rule language and a captive server environment that can be installed locally. We propose a set of benchmark rule-learning tasks and plan to support a performance leader-board for researchers interested in attempting to learn our rules. GOHR complements existing environments by allowing fine, controlled modifications to tasks, enabling experimenters to better understand how each facet of a given learning task contributes to its practical difficulty for an arbitrary ML algorithm.
GTJul 20, 2024
Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful ContributionAlex Clinton, Yiding Chen, Xiaojin Zhu et al.
We study a collaborative learning problem where $m$ agents aim to estimate a vector $μ=(μ_1,\ldots,μ_d)\in \mathbb{R}^d$ by sampling from associated univariate normal distributions $\{\mathcal{N}(μ_k, σ^2)\}_{k\in[d]}$. Agent $i$ incurs a cost $c_{i,k}$ to sample from $\mathcal{N}(μ_k, σ^2)$. Instead of working independently, agents can exchange data, collecting cheaper samples and sharing them in return for costly data, thereby reducing both costs and estimation error. We design a mechanism to facilitate such collaboration, while addressing two key challenges: ensuring individually rational (IR) and fair outcomes so all agents benefit, and preventing strategic behavior (e.g. non-collection, data fabrication) to avoid socially undesirable outcomes. We design a mechanism and an associated Nash equilibrium (NE) which minimizes the social penalty-sum of agents' estimation errors and collection costs-while being IR for all agents. We achieve a $\mathcal{O}(\sqrt{m})$-approximation to the minimum social penalty in the worst case and an $\mathcal{O}(1)$-approximation under favorable conditions. Additionally, we establish three hardness results: no nontrivial mechanism guarantees (i) a dominant strategy equilibrium where agents report truthfully, (ii) is IR for every strategy profile of other agents, (iii) or avoids a worst-case $Ω(\sqrt{m})$ price of stability in any NE. Finally, by integrating concepts from axiomatic bargaining, we demonstrate that our mechanism supports fairer outcomes than one which minimizes social penalty.
LGFeb 2, 2025Code
Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based ExplorationMingyu Chen, Yiding Chen, Wen Sun et al.
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.
LGMay 27, 2025Code
Efficient Controllable Diffusion via Optimal Classifier GuidanceOwen Oertell, Shikun Sun, Yiding Chen et al.
The controllable generation of diffusion models aims to steer the model to generate samples that optimize some given objective functions. It is desirable for a variety of applications including image generation, molecule generation, and DNA/sequence generation. Reinforcement Learning (RL) based fine-tuning of the base model is a popular approach but it can overfit the reward function while requiring significant resources. We frame controllable generation as a problem of finding a distribution that optimizes a KL-regularized objective function. We present SLCD -- Supervised Learning based Controllable Diffusion, which iteratively generates online data and trains a small classifier to guide the generation of the diffusion model. Similar to the standard classifier-guided diffusion, SLCD's key computation primitive is classification and does not involve any complex concepts from RL or control. Via a reduction to no-regret online learning analysis, we show that under KL divergence, the output from SLCD provably converges to the optimal solution of the KL-regularized objective. Further, we empirically demonstrate that SLCD can generate high quality samples with nearly the same inference time as the base model in both image generation with continuous diffusion and biological sequence generation with discrete diffusion. Our code is available at https://github.com/Owen-Oertell/slcd
LGOct 17, 2024
Diffusing States and Matching Scores: A New Framework for Imitation LearningRunzhe Wu, Yiding Chen, Gokul Swamy et al.
Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing score-matching along diffused states to measure the discrepancy between the expert's and learner's states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms both GAN-style imitation learning baselines and discriminator-free imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, crawl, and navigate through obstacles.
LGMay 28, 2025
Scaling Offline RL via Efficient and Expressive Shortcut ModelsNicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen et al.
Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.
LGMay 6, 2025
Convergence Of Consistency Model With Multistep Sampling Under General Data AssumptionsYiding Chen, Yiyi Zhang, Owen Oertell et al.
Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows one-step fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling.
LGJun 8, 2025
A Cramér-von Mises Approach to Incentivizing Truthful Data SharingAlex Clinton, Thomas Zeng, Yiding Chen et al.
Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent's data against others' to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test inspired by the Cramér-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in three canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.
LGJun 11, 2021
Corruption-Robust Offline Reinforcement LearningXuezhou Zhang, Yiding Chen, Jerry Zhu et al.
We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples $(s, a, r, s')$, an adversary is allowed to arbitrarily modify $ε$ fraction of the tuples. From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case $Ω(dε)$ optimality gap is unavoidable in linear MDP of dimension $d$, even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose robust variants of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which achieve near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of $ε$ to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of $ε$ is necessary, as we show that being adaptive to unknown $ε$ is impossible.This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.
LGFeb 11, 2021
Robust Policy Gradient against Strong Data CorruptionXuezhou Zhang, Yiding Chen, Xiaojin Zhu et al.
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $ε$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(ε)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrtε)$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(ε^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.
OCOct 18, 2019
Error Lower Bounds of Constant Step-size Stochastic Gradient DescentZhiyan Ding, Yiding Chen, Qin Li et al.
Stochastic Gradient Descent (SGD) plays a central role in modern machine learning. While there is extensive work on providing error upper bound for SGD, not much is known about SGD error lower bound. In this paper, we study the convergence of constant step-size SGD. We provide error lower bound of SGD for potentially non-convex objective functions with Lipschitz gradients. To our knowledge, this is the first analysis for SGD error lower bound without the strong convexity assumption. We use experiments to illustrate our theoretical results.
LGFeb 1, 2019
Optimal Attack against Autoregressive Models by Manipulating the EnvironmentYiding Chen, Xiaojin Zhu
We describe an optimal adversarial attack formulation against autoregressive time series forecast using Linear Quadratic Regulator (LQR). In this threat model, the environment evolves according to a dynamical system; an autoregressive model observes the current environment state and predicts its future values; an attacker has the ability to modify the environment state in order to manipulate future autoregressive forecasts. The attacker's goal is to force autoregressive forecasts into tracking a target trajectory while minimizing its attack expenditure. In the white-box setting where the attacker knows the environment and forecast models, we present the optimal attack using LQR for linear models, and Model Predictive Control (MPC) for nonlinear models. In the black-box setting, we combine system identification and MPC. Experiments demonstrate the effectiveness of our attacks.