Adam Wierman

LG
h-index67
74papers
1,335citations
Novelty59%
AI Score61

74 Papers

67.7PFMay 30
Maximizing Compute Capacity in AI Data Centers through Cooling, Energy Storage, and Computing Adaptation

Shaolei Ren, Mohammad A. Islam, Adam Wierman

The deployment of artificial intelligence is increasingly constrained by limited site-level power capacity, which must support both compute systems and non-compute systems (primarily cooling) at all times. Cooling power demand, especially in non-evaporative cooling systems, can increase substantially with ambient temperature in the summer, producing recurring periods of elevated cooling power that often lasts for multiple hours per day. Therefore, maximizing compute capacity under a limited site-level power budget is an important planning and operational challenge. Sizing the compute system conservatively based on peak cooling power can leave part of the site-level power capacity underutilized when the cooling power is below its peak, particularly in cooler months. On the other hand, sizing the compute system aggressively based on low cooling power can cause the total site-level power demand to exceed the site-level power capacity during hot days in the summer. This paper proposes ComputeAmp (Compute Amplifier), a framework that maximizes the compute capacity by jointly and dynamically leveraging cooling, battery energy storage, and computing-based adaptation. We discuss the opportunities and limitations of ComputeAmp and illustrate its potential to significantly expand usable compute capacity within local power and water resource limits. We also present a problem formulation for ComputeAmp and highlight a few algorithmic and operational challenges.

LGMar 30, 2023
Contextual Combinatorial Bandits with Probabilistically Triggered Arms

Xutong Liu, Jinhang Zuo, Siwei Wang et al. · uw

We study contextual combinatorial bandits with probabilistically triggered arms (C$^2$MAB-T) under a variety of smoothness conditions that capture a wide range of applications, such as contextual cascading bandits and contextual influence maximization bandits. Under the triggering probability modulated (TPM) condition, we devise the C$^2$-UCB-T algorithm and propose a novel analysis that achieves an $\tilde{O}(d\sqrt{KT})$ regret bound, removing a potentially exponentially large factor $O(1/p_{\min})$, where $d$ is the dimension of contexts, $p_{\min}$ is the minimum positive probability that any arm can be triggered, and batch-size $K$ is the maximum number of arms that can be triggered per round. Under the variance modulated (VM) or triggering probability and variance modulated (TPVM) conditions, we propose a new variance-adaptive algorithm VAC$^2$-UCB and derive a regret bound $\tilde{O}(d\sqrt{T})$, which is independent of the batch-size $K$. As a valuable by-product, our analysis technique and variance-adaptive algorithm can be applied to the CMAB-T and C$^2$MAB setting, improving existing results there as well. We also include experiments that demonstrate the improved performance of our algorithms compared with benchmark algorithms on synthetic and real-world datasets.

SYAug 17, 2018
Failure Localization in Power Systems via Tree Partitions

Linqi Guo, Chen Liang, Alessandro Zocca et al.

Cascading failures in power systems propagate non-locally, making the control and mitigation of outages extremely hard. In this work, we use the emerging concept of the tree partition of transmission networks to provide an analytical characterization of line failure localizability in transmission systems. Our results rigorously establish the well perceived intuition in power community that failures cannot cross bridges, and reveal a finer-grained concept that encodes more precise information on failure propagations within tree-partition regions. Specifically, when a non-bridge line is tripped, the impact of this failure only propagates within well-defined components, which we refer to as cells, of the tree partition defined by the bridges. In contrast, when a bridge line is tripped, the impact of this failure propagates globally across the network, affecting the power flow on all remaining transmission lines. This characterization suggests that it is possible to improve the system robustness by temporarily switching off certain transmission lines, so as to create more, smaller components in the tree partition; thus spatially localizing line failures and making the grid less vulnerable to large-scale outages. We illustrate this approach using the IEEE 118-bus test system and demonstrate that switching off a negligible portion of transmission lines allows the impact of line failures to be significantly more localized without substantial changes in line congestion.

GTMar 23, 2016
Routing and Staffing when Servers are Strategic

Ragavendran Gopalakrishnan, Sherwin Doroudi, Amy R. Ward et al.

Traditionally, research focusing on the design of routing and staffing policies for service systems has modeled servers as having fixed (possibly heterogeneous) service rates. However, service systems are generally staffed by people. Furthermore, people respond to workload incentives; that is, how hard a person works can depend both on how much work there is, and how the work is divided between the people responsible for it. In a service system, the routing and staffing policies control such workload incentives; and so the rate servers work will be impacted by the system's routing and staffing policies. This observation has consequences when modeling service system performance, and our objective is to investigate those consequences. We do this in the context of the M/M/N queue, which is the canonical model for large service systems. First, we present a model for "strategic" servers that choose their service rate in order to maximize a trade-off between an "effort cost", which captures the idea that servers exert more effort when working at a faster rate, and a "value of idleness", which assumes that servers value having idle time. Next, we characterize the symmetric Nash equilibrium service rate under any routing policy that routes based on the server idle time. We find that the system must operate in a quality-driven regime, in which servers have idle time, in order for an equilibrium to exist, which implies that the staffing must have a first-order term that strictly exceeds that of the common square-root staffing policy. Then, within the class of policies that admit an equilibrium, we (asymptotically) solve the problem of minimizing the total cost, when there are linear staffing costs and linear waiting costs. Finally, we end by exploring the question of whether routing policies that are based on the service rate, instead of the server idle time, can improve system performance.

AIJun 20, 2023
Towards Environmentally Equitable AI via Geographical Load Balancing

Pengfei Li, Jianyi Yang, Adam Wierman et al.

Fueled by the soaring popularity of large language and foundation models, the accelerated growth of artificial intelligence (AI) models' enormous environmental footprint has come under increased scrutiny. While many approaches have been proposed to make AI more energy-efficient and environmentally friendly, environmental inequity -- the fact that AI's environmental footprint can be disproportionately higher in certain regions than in others -- has emerged, raising social-ecological justice concerns. This paper takes a first step toward addressing AI's environmental inequity by balancing its regional negative environmental impact. Concretely, we focus on the carbon and water footprints of AI model inference and propose equity-aware geographical load balancing (GLB) to explicitly address AI's environmental impacts on the most disadvantaged regions. We run trace-based simulations by considering a set of 10 geographically-distributed data centers that serve inference requests for a large language AI model. The results demonstrate that existing GLB approaches may amplify environmental inequity while our proposed equity-aware GLB can significantly reduce the regional disparity in terms of carbon and water footprints.

LGJun 23, 2022
Chasing Convex Bodies and Functions with Black-Box Advice

Nicolas Christianson, Tinashe Handina, Adam Wierman

We consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as $\textit{consistency}$, while also ensuring worst-case $\textit{robustness}$ even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem's convexity. The first, INTERP, achieves $(\sqrt{2}+ε)$-consistency and $\mathcal{O}(\frac{C}{ε^2})$-robustness for any $ε> 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, BDINTERP, achieves $(1+ε)$-consistency and $\mathcal{O}(\frac{CD}ε)$-robustness when the problem has bounded diameter $D$. Further, we show that BDINTERP achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $α$-polyhedral.

LGNov 30, 2022
Global Convergence of Localized Policy Iteration in Networked Multi-Agent Reinforcement Learning

Yizhou Zhang, Guannan Qu, Pan Xu et al.

We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network. The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards. To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) algorithm that provably learns a near-globally-optimal policy using only local information. In particular, we show that, despite restricting each agent's attention to only its $κ$-hop neighborhood, the agents are able to learn a policy with an optimality gap that decays polynomially in $κ$. In addition, we show the finite-sample convergence of LPI to the global optimal policy, which explicitly captures the trade-off between optimality and computational complexity in choosing $κ$. Numerical simulations demonstrate the effectiveness of LPI.

GTMar 3, 2023
A Finite-Sample Analysis of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar et al.

We study two-player zero-sum stochastic games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics, which integrates a discrete and doubly smoothed variant of the best-response dynamics into temporal-difference (TD)-learning and minimax value iteration. The resulting dynamics are payoff-based, convergent, rational, and symmetric among players. Our main results provide finite-sample guarantees. In particular, we prove the first-known $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity bound for payoff-based independent learning dynamics, up to a smoothing bias. In the special case where the stochastic game has only one state (i.e., matrix games), we provide a sharper $\tilde{\mathcal{O}}(1/ε)$ sample complexity. Our analysis uses a novel coupled Lyapunov drift approach to capture the evolution of multiple sets of coupled and stochastic iterates, which might be of independent interest.

OCApr 12, 2022
Near-Optimal Distributed Linear-Quadratic Regulator for Networked Systems

Sungho Shin, Yiheng Lin, Guannan Qu et al.

This paper studies the trade-off between the degree of decentralization and the performance of a distributed controller in a linear-quadratic control setting. We study a system of interconnected agents over a graph and a distributed controller, called $κ$-distributed control, which lets the agents make control decisions based on the state information within distance $κ$ on the underlying graph. This controller can tune its degree of decentralization using the parameter $κ$ and thus allows a characterization of the relationship between decentralization and performance. We show that under mild assumptions, including stabilizability, detectability, and a subexponentially growing graph condition, the performance difference between $κ$-distributed control and centralized optimal control becomes exponentially small in $κ$. This result reveals that distributed control can achieve near-optimal performance with a moderate degree of decentralization, and thus it is an effective controller architecture for large-scale networked systems.

LGJun 3, 2022
KCRL: Krasovskii-Constrained Reinforcement Learning with Guaranteed Stability in Nonlinear Dynamical Systems

Sahin Lale, Yuanyuan Shi, Guannan Qu et al.

Learning a dynamical system requires stabilizing the unknown dynamics to avoid state blow-ups. However, current reinforcement learning (RL) methods lack stabilization guarantees, which limits their applicability for the control of safety-critical systems. We propose a model-based RL framework with formal stability guarantees, Krasovskii Constrained RL (KCRL), that adopts Krasovskii's family of Lyapunov functions as a stability constraint. The proposed method learns the system dynamics up to a confidence interval using feature representation, e.g. Random Fourier Features. It then solves a constrained policy optimization problem with a stability constraint based on Krasovskii's method using a primal-dual approach to recover a stabilizing policy. We show that KCRL is guaranteed to learn a stabilizing policy in a finite number of interactions with the underlying unknown system. We also derive the sample complexity upper bound for stabilization of unknown nonlinear dynamical systems via the KCRL framework.

SYFeb 5, 2016
Optimizing Energy Storage Participation in Emerging Power Markets

Hao Chen, Zhenhua Liu, Ayse K. Coskun et al.

The growing amount of intermittent renewables in power generation creates challenges for real-time matching of supply and demand in the power grid. Emerging ancillary power markets provide new incentives to consumers (e.g., electrical vehicles, data centers, and others) to perform demand response to help stabilize the electricity grid. A promising class of potential demand response providers includes energy storage systems (ESSs). This paper evaluates the benefits of using various types of novel ESS technologies for a variety of emerging smart grid demand response programs, such as regulation services reserves (RSRs), contingency reserves, and peak shaving. We model, formulate and solve optimization problems to maximize the net profit of ESSs in providing each demand response. Our solution selects the optimal power and energy capacities of the ESS, determines the optimal reserve value to provide as well as the ESS real-time operational policy for program participation. Our results highlight that applying ultra-capacitors and flywheels in RSR has the potential to be up to 30 times more profitable than using common battery technologies such as LI and LA batteries for peak shaving.

AIFeb 3Code
Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity

Yingxuan Yang, Chengrui Qu, Muning Wen et al.

LLM-based multi-agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information-theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture-agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce $K^*$, an effective channel count that quantifies the number of effective channels without ground-truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity-aware design. Code and Dataset are available at the link: https://github.com/SafeRL-Lab/Agent-Scaling.

68.5CYMay 18
The Unpaid Toll: Estimating and Addressing the Public Health Impact of Data Centers

Yuelin Han, Zhifeng Wu, Pengfei Li et al.

The surging demand for artificial intelligence (AI) has led to a rapid expansion of energy-intensive data centers, contributing to criteria air pollutant emissions and raising public health concerns that have received comparatively limited attention in sustainability assessments. This paper introduces a principled methodology to model air pollutant emissions for data centers and estimate the public health impacts. Our findings reveal that the growing demand for AI and computing technologies is projected to push the total annual public health burden of U.S. data centers up to more than $20 billion in 2028. Although national-level impacts remain modest, data center health costs are unevenly distributed: in the most affected counties, the estimated per-household health burden can reach about seven times the national average. Next, we propose a health-informed computing framework that explicitly incorporates public health impacts into data center resource management across space and time, mitigating public health costs while supporting environmental sustainability. More broadly, we recommend extended energy reporting to include public health impact of data centers and paying attention to all impacted communities.

LGJul 20, 2023
Beyond Black-Box Advice: Learning-Augmented Algorithms for MDPs with Q-Value Predictions

Tongxin Li, Yiheng Lin, Shaolei Ren et al.

We study the tradeoff between consistency and robustness in the context of a single-trajectory time-varying Markov Decision Process (MDP) with untrusted machine-learned advice. Our work departs from the typical approach of treating advice as coming from black-box sources by instead considering a setting where additional information about how the advice is generated is available. We prove a first-of-its-kind consistency and robustness tradeoff given Q-value advice under a general MDP model that includes both continuous and discrete state/action spaces. Our results highlight that utilizing Q-value advice enables dynamic pursuit of the better of machine-learned advice and a robust baseline, thus result in near-optimal performance guarantees, which provably improves what can be obtained solely with black-box advice.

LGMar 8, 2023
Convergence Rates for Localized Actor-Critic in Networked Markov Potential Games

Zhaoyi Zhou, Zaiwei Chen, Yiheng Lin et al.

We introduce a class of networked Markov potential games in which agents are associated with nodes in a network. Each agent has its own local potential function, and the reward of each agent depends only on the states and actions of the agents within a neighborhood. In this context, we propose a localized actor-critic algorithm. The algorithm is scalable since each agent uses only local information and does not need access to the global state. Further, the algorithm overcomes the curse of dimensionality through the use of function approximation. Our main results provide finite-sample guarantees up to a localization error and a function approximation error. Specifically, we achieve an $\tilde{\mathcal{O}}(\tildeε^{-4})$ sample complexity measured by the averaged Nash regret. This is the first finite-sample bound for multi-agent competitive games that does not depend on the number of agents.

LGJun 2, 2022
Equipping Black-Box Policies with Model-Based Advice for Stable Nonlinear Control

Tongxin Li, Ruixiao Yang, Guannan Qu et al.

Machine-learned black-box policies are ubiquitous for nonlinear control problems. Meanwhile, crude model information is often available for these problems from, e.g., linear approximations of nonlinear dynamics. We study the problem of equipping a black-box control policy with model-based advice for nonlinear control on a single trajectory. We first show a general negative result that a naive convex combination of a black-box policy and a linear model-based policy can lead to instability, even if the two policies are both stabilizing. We then propose an adaptive $λ$-confident policy, with a coefficient $λ$ indicating the confidence in a black-box policy, and prove its stability. With bounded nonlinearity, in addition, we show that the adaptive $λ$-confident policy achieves a bounded competitive ratio when a black-box policy is near-optimal. Finally, we propose an online learning approach to implement the adaptive $λ$-confident policy and verify its efficacy in case studies about the CartPole problem and a real-world electric vehicle (EV) charging problem with data bias due to COVID-19.

LGSep 2, 2024
Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar et al.

In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(ε^{-1})$ to find the Nash distribution and a sample complexity of $O(ε^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(ε^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.

DSOct 31, 2023
Online Conversion with Switching Costs: Robust and Learning-Augmented Algorithms

Adam Lechowicz, Nicolas Christianson, Bo Sun et al.

We introduce and study online conversion with switching costs, a family of online problems that capture emerging problems at the intersection of energy and sustainability. In this problem, an online player attempts to purchase (alternatively, sell) fractional shares of an asset during a fixed time horizon with length $T$. At each time step, a cost function (alternatively, price function) is revealed, and the player must irrevocably decide an amount of asset to convert. The player also incurs a switching cost whenever their decision changes in consecutive time steps, i.e., when they increase or decrease their purchasing amount. We introduce competitive (robust) threshold-based algorithms for both the minimization and maximization variants of this problem, and show they are optimal among deterministic online algorithms. We then propose learning-augmented algorithms that take advantage of untrusted black-box advice (such as predictions from a machine learning model) to achieve significantly better average-case performance without sacrificing worst-case competitive guarantees. Finally, we empirically evaluate our proposed algorithms using a carbon-aware EV charging case study, showing that our algorithms substantially improve on baseline methods for this problem.

LGOct 17, 2023
Online Algorithms with Uncertainty-Quantified Predictions

Bo Sun, Jerry Huang, Nicolas Christianson et al.

The burgeoning field of algorithms with predictions studies the problem of using possibly imperfect machine learning predictions to improve online algorithm performance. While nearly all existing algorithms in this framework make no assumptions on prediction quality, a number of methods providing uncertainty quantification (UQ) on machine learning models have been developed in recent years, which could enable additional information about prediction quality at decision time. In this work, we investigate the problem of optimally utilizing uncertainty-quantified predictions in the design of online algorithms. In particular, we study two classic online problems, ski rental and online search, where the decision-maker is provided predictions augmented with UQ describing the likelihood of the ground truth falling within a particular range of values. We demonstrate that non-trivial modifications to algorithm design are needed to fully leverage the UQ predictions. Moreover, we consider how to utilize more general forms of UQ, proposing an online learning framework that learns to exploit UQ to make decisions in multi-instance settings.

LGSep 30, 2024
End-to-End Conformal Calibration for Optimization Under Uncertainty

Christopher Yeh, Nicolas Christianson, Alan Wu et al.

Machine learning can significantly improve performance for decision-making under uncertainty across a wide range of domains. However, ensuring robustness guarantees requires well-calibrated uncertainty estimates, which can be difficult to achieve with neural networks. Moreover, in high-dimensional settings, there may be many valid uncertainty estimates, each with its own performance profile - i.e., not all uncertainty is equally valuable for downstream decision-making. To address this problem, this paper develops an end-to-end framework to learn uncertainty sets for conditional robust optimization in a way that is informed by the downstream decision-making loss, with robustness and calibration guarantees provided by conformal prediction. In addition, we propose to represent general convex uncertainty sets with partially input-convex neural networks, which are learned as part of our framework. Our approach consistently improves upon two-stage estimate-then-optimize baselines on concrete applications in energy storage arbitrage and portfolio optimization.

DSAug 14, 2024
Learning-Augmented Competitive Algorithms for Spatiotemporal Online Allocation with Deadline Constraints

Adam Lechowicz, Nicolas Christianson, Bo Sun et al.

We introduce and study spatiotemporal online allocation with deadline constraints ($\mathsf{SOAD}$), a new online problem motivated by emerging challenges in sustainability and energy. In $\mathsf{SOAD}$, an online player completes a workload by allocating and scheduling it on the points of a metric space $(X, d)$ while subject to a deadline $T$. At each time step, a service cost function is revealed that represents the cost of servicing the workload at each point, and the player must irrevocably decide the current allocation of work to points. Whenever the player moves this allocation, they incur a movement cost defined by the distance metric $d(\cdot, \ \cdot)$ that captures, e.g., an overhead cost. $\mathsf{SOAD}$ formalizes the open problem of combining general metrics and deadline constraints in the online algorithms literature, unifying problems such as metrical task systems and online search. We propose a competitive algorithm for $\mathsf{SOAD}$ along with a matching lower bound establishing its optimality. Our main algorithm, \textsc{ST-CLIP}, is a learning-augmented algorithm that takes advantage of predictions (e.g., forecasts of relevant costs) and achieves an optimal consistency-robustness trade-off. We evaluate our proposed algorithms in a simulated case study of carbon-aware spatiotemporal workload management, an application in sustainable computing that schedules a delay-tolerant batch compute job on a distributed network of data centers. In these experiments, we show that \textsc{ST-CLIP} substantially improves on heuristic baseline methods.

OCOct 31, 2023
Best of Both Worlds Guarantees for Smoothed Online Quadratic Optimization

Neelkamal Bhuyan, Debankur Mukherjee, Adam Wierman

We study the smoothed online quadratic optimization (SOQO) problem where, at each round $t$, a player plays an action $x_t$ in response to a quadratic hitting cost and an additional squared $\ell_2$-norm cost for switching actions. This problem class has strong connections to a wide range of application domains including smart grid management, adaptive control, and data center management, where switching-efficient algorithms are highly sought after. We study the SOQO problem in both adversarial and stochastic settings, and in this process, perform the first stochastic analysis of this class of problems. We provide the online optimal algorithm when the minimizers of the hitting cost function evolve as a general stochastic process, which, for the case of martingale process, takes the form of a distribution-agnostic dynamic interpolation algorithm (LAI). Next, we present the stochastic-adversarial trade-off by proving an $Ω(T)$ expected regret for the adversarial optimal algorithm in the literature (ROBD) with respect to LAI and, a sub-optimal competitive ratio for LAI in the adversarial setting. Finally, we present a best-of-both-worlds algorithm that obtains a robust adversarial performance while simultaneously achieving a near-optimal stochastic performance.

LGOct 31, 2023
Robust Learning for Smoothed Online Convex Optimization with Feedback Delay

Pengfei Li, Jianyi Yang, Adam Wierman et al.

We study a challenging form of Smoothed Online Convex Optimization, a.k.a. SOCO, including multi-step nonlinear switching costs and feedback delay. We propose a novel machine learning (ML) augmented online algorithm, Robustness-Constrained Learning (RCL), which combines untrusted ML predictions with a trusted expert online algorithm via constrained projection to robustify the ML prediction. Specifically,we prove that RCL is able to guarantee$(1+λ)$-competitiveness against any given expert for any$λ>0$, while also explicitly training the ML model in a robustification-aware manner to improve the average-case performance. Importantly,RCL is the first ML-augmented algorithm with a provable robustness guarantee in the case of multi-step switching cost and feedback delay.We demonstrate the improvement of RCL in both robustness and average performance using battery management for electrifying transportationas a case study.

OCSep 26, 2023
Learning the Uncertainty Sets for Control Dynamics via Set Membership: A Non-Asymptotic Analysis

Yingying Li, Jing Yu, Lauren Conger et al.

This paper studies uncertainty set estimation for unknown linear systems. Uncertainty sets are crucial for the quality of robust control since they directly influence the conservativeness of the control design. Departing from the confidence region analysis of least squares estimation, this paper focuses on set membership estimation (SME). Though good numerical performances have attracted applications of SME in the control literature, the non-asymptotic convergence rate of SME for linear systems remains an open question. This paper provides the first convergence rate bounds for SME and discusses variations of SME under relaxed assumptions. We also provide numerical results demonstrating SME's practical promise.

LGNov 2, 2023
Anytime-Competitive Reinforcement Learning with Policy Prior

Jianyi Yang, Pengfei Li, Tongxin Li et al.

This paper studies the problem of Anytime-Competitive Markov Decision Process (A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim to optimize the expected reward while constraining the expected cost over random dynamics, but the cost in a specific episode can still be unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior. We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints. The regret analysis shows the policy asymptotically matches the optimal reward achievable under the anytime competitive constraints. Experiments on the application of carbon-intelligent computing verify the reward performance and cost constraint guarantee of ACRL.

LGJun 16, 2023
Learning-Augmented Decentralized Online Convex Optimization in Networks

Pengfei Li, Jianyi Yang, Adam Wierman et al.

This paper studies decentralized online convex optimization in a networked multi-agent system and proposes a novel algorithm, Learning-Augmented Decentralized Online optimization (LADO), for individual agents to select actions only based on local online information. LADO leverages a baseline policy to safeguard online actions for worst-case robustness guarantees, while staying close to the machine learning (ML) policy for average performance improvement. In stark contrast with the existing learning-augmented online algorithms that focus on centralized settings, LADO achieves strong robustness guarantees in a decentralized setting. We also prove the average cost bound for LADO, revealing the tradeoff between average performance and worst-case robustness and demonstrating the advantage of training the ML policy by explicitly considering the robustness requirement.

LGSep 30, 2024
Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning

Laixi Shi, Jingchu Gai, Eric Mazumdar et al.

Standard multi-agent reinforcement learning (MARL) algorithms are vulnerable to sim-to-real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst-case performance when game dynamics shift within a prescribed uncertainty set. RMGs remains under-explored, from reasonable problem formulation to the development of sample-efficient algorithms. Two notorious and open challenges are the formulation of the uncertainty set and whether the corresponding RMGs can overcome the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs inspired by behavioral economics, where each agent's uncertainty set is shaped by both the environment and the integrated behavior of other agents. We first establish the well-posedness of this class of RMGs by proving the existence of game-theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample-efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs, regardless of the uncertainty set formulation.

LGJan 14
SCaLE: Switching Cost aware Learning and Exploration

Neelkamal Bhuyan, Debankur Mukherjee, Adam Wierman

This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.

LGFeb 27, 2025Code
Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning

Shangding Gu, Laixi Shi, Muning Wen et al.

Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchmarks for robust RL. Current robust RL policies often focus on a specific type of uncertainty and are evaluated in distinct, one-off environments. In this work, we introduce Robust-Gymnasium, a unified modular benchmark designed for robust RL that supports a wide variety of disruptions across all key RL components-agents' observed state and reward, agents' actions, and the environment. Offering over sixty diverse task environments spanning control and robotics, safe RL, and multi-agent RL, it provides an open-source and user-friendly tool for the community to assess current methods and foster the development of robust RL algorithms. In addition, we benchmark existing standard and robust RL algorithms within this framework, uncovering significant deficiencies in each and offering new insights.

LGNov 3, 2023
Adversarial Attacks on Cooperative Multi-agent Bandits

Jinhang Zuo, Zhiyao Zhang, Xuchuang Wang et al.

Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm $T-o(T)$ times while incurring $o(T)$ attack costs in $T$ rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.

71.4CYMar 18
Small Bottle, Big Pipe: Quantifying and Addressing the Impact of Data Centers on Public Water Systems

Yuelin Han, Pengfei Li, Adam Wierman et al.

Water is a critical resource for data centers and an efficient means of cooling. However, meeting the growing water demand of data centers requires substantial peak water withdrawals, which many communities in the United States cannot supply, especially during the hottest days of the year. This largely overlooked water capacity constraint is emerging as a bottleneck for data centers and can force operators to rely on less efficient dry cooling, further stressing the power grid during summer peaks. In this paper, we focus on the direct water withdrawal of U.S. data centers for cooling and examine their impacts on public water systems. Our analysis indicates that, if the 2024 water use intensity persists, U.S. data centers could collectively require 697-1,451 million gallons per day (MGD) of new water capacity through 2030, comparable to New York City's average daily supply of roughly 1,000 MGD. Under an optimistic scenario with a compound annual water use intensity reduction by 10%, the water capacity demand decreases to 227-604 MGD, although high-growth IT loads could still require enough capacity to hypothetically supply about half of New York City for most of the year. The total valuation of the new water capacity is on the order of \$10 billion, reaching up to \$58 billion in the high-growth case. These impacts are highly concentrated on communities hosting data centers. Finally, we provide recommendations to address the growing water capacity demand of U.S. data centers, including reporting peak water use, developing corporate-community partnerships, adopting a Water Capacity Neutral approach (colloquially "Pipe Neutral") to allow host communities to retain limited water capacity resources, and implementing coordinated water-power planning to responsibly leverage water for peak power reduction and opportunistically utilize surplus power to mitigate impacts on public water systems.

45.4SYMar 24
Data-Driven Successive Linearization for Optimal Voltage Control

Yiwei Dong, Wenqi Cui, Han Xu et al.

Power distribution systems are increasingly exposed to large voltage fluctuations driven by intermittent renewable generation and time varying loads (e.g., electric vehicles and storage). To address this challenge, a number of advanced controllers have been proposed for voltage regulation. However, these controllers typically rely on fixed linear approximations of voltage dynamics. As a result, the solutions may become infeasible when applied to the actual voltage behavior governed by nonlinear power flow equations, particularly under heavy power injection from distributed energy resources. This paper proposes a data-driven successive linearization approach for voltage control under nonlinear power flow constraints. By leveraging the fact that the deviation between the nonlinear power flow solution and its linearization is bounded by the distance from the operating point, we perform data-driven linearization around the most recent operating point. Convergence of the proposed method to a neighborhood of KKT points is established by exploiting the convexity of the objective function and structural properties of the nonlinear constraints. Case studies show that the proposed approach achieves fast convergence and adapts quickly to changes in net load.

AIFeb 11Code
Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

Chengrui Qu, Christopher Yeh, Kishan Panaganti et al.

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution, where value-factorization methods enforce the individual-global-maximum (IGM) principle so that decentralized greedy actions recover the team-optimal joint action. However, the reliability of this recipe in real-world settings remains unreliable due to environmental uncertainties arising from the sim-to-real gap, model mismatch, and system noise. We address this gap by introducing Distributionally robust IGM (DrIGM), a principle that requires each agent's robust greedy action to align with the robust team-optimal joint action. We show that DrIGM holds for a novel definition of robust individual action values, which is compatible with decentralized greedy execution and yields a provable robustness guarantee for the whole system. Building on this foundation, we derive DrIGM-compliant robust variants of existing value-factorization architectures (e.g., VDN/QMIX/QTRAN) that (i) train on robust Q-targets, (ii) preserve scalability, and (iii) integrate seamlessly with existing codebases without bespoke per-agent reward shaping. Empirically, on high-fidelity SustainGym simulators and a StarCraft game environment, our methods consistently improve out-of-distribution performance. Code and data are available at https://github.com/crqu/robust-coMARL.

LGFeb 18
Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning

Emile Anand, Richard Hoffmann, Sarah Liaw et al.

Coordinating large populations of interacting agents is a central challenge in multi-agent reinforcement learning (MARL), where the size of the joint state-action space scales exponentially with the number of agents. Mean-field methods alleviate this burden by aggregating agent interactions, but these approaches assume homogeneous interactions. Recent graphon-based frameworks capture heterogeneity, but are computationally expensive as the number of agents grows. Therefore, we introduce $\texttt{GMFS}$, a $\textbf{G}$raphon $\textbf{M}$ean-$\textbf{F}$ield $\textbf{S}$ubsampling framework for scalable cooperative MARL with heterogeneous agent interactions. By subsampling $κ$ agents according to interaction strength, we approximate the graphon-weighted mean-field and learn a policy with sample complexity $\mathrm{poly}(κ)$ and optimality gap $O(1/\sqrtκ)$. We verify our theory with numerical simulations in robotic coordination, showing that $\texttt{GMFS}$ achieves near-optimal performance.

LGApr 29, 2024
Sample-Efficient Robust Multi-Agent Reinforcement Learning in the Face of Environmental Uncertainty

Laixi Shi, Eric Mazumdar, Yuejie Chi et al.

To overcome the sim-to-real gap in reinforcement learning (RL), learned policies must maintain robustness against environmental uncertainties. While robust RL has been widely studied in single-agent regimes, in multi-agent environments, the problem remains understudied -- despite the fact that the problems posed by environmental uncertainties are often exacerbated by strategic interactions. This work focuses on learning in distributionally robust Markov games (RMGs), a robust variant of standard Markov games, wherein each agent aims to learn a policy that maximizes its own worst-case performance when the deployed environment deviates within its own prescribed uncertainty set. This results in a set of robust equilibrium strategies for all agents that align with classic notions of game-theoretic equilibria. Assuming a non-adaptive sampling mechanism from a generative model, we propose a sample-efficient model-based algorithm (DRNVI) with finite-sample complexity guarantees for learning robust variants of various notions of game-theoretic equilibria. We also establish an information-theoretic lower bound for solving RMGs, which confirms the near-optimal sample complexity of DRNVI with respect to problem-dependent factors such as the size of the state space, the target accuracy, and the horizon length.

LGNov 12, 2024
Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization

Chenbei Lu, Laixi Shi, Zaiwei Chen et al.

Reinforcement Learning (RL) algorithms are known to suffer from the curse of dimensionality, which refers to the fact that large-scale problems often lead to exponentially high sample complexity. A common solution is to use deep neural networks for function approximation; however, such approaches typically lack theoretical guarantees. To provably address the curse of dimensionality, we observe that many real-world problems exhibit task-specific model structures that, when properly leveraged, can improve the sample efficiency of RL. Building on this insight, we propose overcoming the curse of dimensionality by approximately factorizing the original Markov decision processes (MDPs) into smaller, independently evolving MDPs. This factorization enables the development of sample-efficient RL algorithms in both model-based and model-free settings, with the latter involving a variant of variance-reduced Q-learning. We provide improved sample complexity guarantees for both proposed algorithms. Notably, by leveraging model structure through the approximate factorization of the MDP, the dependence of sample complexity on the size of the state-action space can be exponentially reduced. Numerically, we demonstrate the practicality of our proposed methods through experiments on both synthetic MDP tasks and a wind farm-equipped storage control problem.

DCMay 22, 2024
Carbon Connect: An Ecosystem for Sustainable Computing

Benjamin C. Lee, David Brooks, Arthur van Benthem et al.

Computing is at a moment of profound opportunity. Emerging applications -- such as capable artificial intelligence, immersive virtual realities, and pervasive sensor systems -- drive unprecedented demand for computer. Despite recent advances toward net zero carbon emissions, the computing industry's gross energy usage continues to rise at an alarming rate, outpacing the growth of new energy installations and renewable energy deployments. A shift towards sustainability is needed to spark a transformation in how computer systems are manufactured, allocated, and consumed. Carbon Connect envisions coordinated research thrusts that produce design and management strategies for sustainable, next-generation computer systems. These strategies must flatten and then reverse growth trajectories for computing power and carbon for society's most rapidly growing applications such as artificial intelligence and virtual spaces. We will require accurate models for carbon accounting in computing technology. For embodied carbon, we must re-think conventional design strategies -- over-provisioned monolithic servers, frequent hardware refresh cycles, custom silicon -- and adopt life-cycle design strategies that more effectively reduce, reuse and recycle hardware at scale. For operational carbon, we must not only embrace renewable energy but also design systems to use that energy more efficiently. Finally, new hardware design and management strategies must be cognizant of economic policy and regulatory landscape, aligning private initiatives with societal goals. Many of these broader goals will require computer scientists to develop deep, enduring collaborations with researchers in economics, law, and industrial ecology to spark change in broader practice.

LGNov 6, 2024
Hybrid Transfer Reinforcement Learning: Provable Sample Efficiency from Shifted-Dynamics Data

Chengrui Qu, Laixi Shi, Kishan Panaganti et al.

Online Reinforcement learning (RL) typically requires high-stakes online interaction data to learn a policy for a target task. This prompts interest in leveraging historical data to improve sample efficiency. The historical data may come from outdated or related source environments with different dynamics. It remains unclear how to effectively use such data in the target task to provably enhance learning and sample efficiency. To address this, we propose a hybrid transfer RL (HTRL) setting, where an agent learns in a target environment while accessing offline data from a source environment with shifted dynamics. We show that -- without information on the dynamics shift -- general shifted-dynamics data, even with subtle shifts, does not reduce sample complexity in the target environment. However, with prior information on the degree of the dynamics shift, we design HySRL, a transfer algorithm that achieves problem-dependent sample complexity and outperforms pure online RL. Finally, our experimental results demonstrate that HySRL surpasses state-of-the-art online RL baseline.

DSFeb 21, 2024
Chasing Convex Functions with Long-term Constraints

Adam Lechowicz, Nicolas Christianson, Bo Sun et al.

We introduce and study a family of online metric problems with long-term constraints. In these problems, an online player makes decisions $\mathbf{x}_t$ in a metric space $(X,d)$ to simultaneously minimize their hitting cost $f_t(\mathbf{x}_t)$ and switching cost as determined by the metric. Over the time horizon $T$, the player must satisfy a long-term demand constraint $\sum_{t} c(\mathbf{x}_t) \geq 1$, where $c(\mathbf{x}_t)$ denotes the fraction of demand satisfied at time $t$. Such problems can find a wide array of applications to online resource allocation in sustainable energy/computing systems. We devise optimal competitive and learning-augmented algorithms for the case of bounded hitting cost gradients and weighted $\ell_1$ metrics, and further show that our proposed algorithms perform well in numerical experiments.

LGDec 8, 2023
Two-Timescale Q-Learning with Function Approximation in Zero-Sum Stochastic Games

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar et al.

We consider two-player zero-sum stochastic games and propose a two-timescale $Q$-learning algorithm with function approximation that is payoff-based, convergent, rational, and symmetric between the two players. In two-timescale $Q$-learning, the fast-timescale iterates are updated in spirit to the stochastic gradient descent and the slow-timescale iterates (which we use to compute the policies) are updated by taking a convex combination between its previous iterate and the latest fast-timescale iterate. Introducing the slow timescale as well as its update equation marks as our main algorithmic novelty. In the special case of linear function approximation, we establish, to the best of our knowledge, the first last-iterate finite-sample bound for payoff-based independent learning dynamics of these types. The result implies a polynomial sample complexity to find a Nash equilibrium in such stochastic games. To establish the results, we model our proposed algorithm as a two-timescale stochastic approximation and derive the finite-sample bound through a Lyapunov-based approach. The key novelty lies in constructing a valid Lyapunov function to capture the evolution of the slow-timescale iterates. Specifically, through a change of variable, we show that the update equation of the slow-timescale iterates resembles the classical smoothed best-response dynamics, where the regularized Nash gap serves as a valid Lyapunov function. This insight enables us to construct a valid Lyapunov function via a generalized variant of the Moreau envelope of the regularized Nash gap. The construction of our Lyapunov function might be of broad independent interest in studying the behavior of stochastic approximation algorithms.

LGMay 25, 2025
Efficient Policy Optimization in Robust Constrained MDPs with Iteration Complexity Guarantees

Sourav Ganguly, Arnob Ghosh, Kishan Panaganti et al.

Constrained decision-making is essential for designing safe policies in real-world control systems, yet simulated environments often fail to capture real-world adversities. We consider the problem of learning a policy that will maximize the cumulative reward while satisfying a constraint, even when there is a mismatch between the real model and an accessible simulator/nominal model. In particular, we consider the robust constrained Markov decision problem (RCMDP) where an agent needs to maximize the reward and satisfy the constraint against the worst possible stochastic model under the uncertainty set centered around an unknown nominal model. Primal-dual methods, effective for standard constrained MDP (CMDP), are not applicable here because of the lack of the strong duality property. Further, one cannot apply the standard robust value-iteration based approach on the composite value function either as the worst case models may be different for the reward value function and the constraint value function. We propose a novel technique that effectively minimizes the constraint value function--to satisfy the constraints; on the other hand, when all the constraints are satisfied, it can simply maximize the robust reward value function. We prove that such an algorithm finds a policy with at most $ε$ sub-optimality and feasible policy after $O(ε^{-2})$ iterations. In contrast to the state-of-the-art method, we do not need to employ a binary search, thus, we reduce the computation time by at least 4x for smaller value of discount factor ($γ$) and by at least 6x for larger value of $γ$.

GTNov 6, 2024
Online Budgeted Matching with General Bids

Jianyi Yang, Pengfei Li, Adam Wierman et al.

Online Budgeted Matching (OBM) is a classic problem with important applications in online advertising, online service matching, revenue management, and beyond. Traditional online algorithms typically assume a small bid setting, where the maximum bid-to-budget ratio (κ) is infinitesimally small. While recent algorithms have tried to address scenarios with non-small or general bids, they often rely on the Fractional Last Matching (FLM) assumption, which allows for accepting partial bids when the remaining budget is insufficient. This assumption, however, does not hold for many applications with indivisible bids. In this paper, we remove the FLM assumption and tackle the open problem of OBM with general bids. We first establish an upper bound of 1-κon the competitive ratio for any deterministic online algorithm. We then propose a novel meta algorithm, called MetaAd, which reduces to different algorithms with first known provable competitive ratios parameterized by the maximum bid-to-budget ratio κ\in [0, 1]. As a by-product, we extend MetaAd to the FLM setting and get provable competitive algorithms. Finally, we apply our competitive analysis to the design learning-augmented algorithms.

LGMay 8, 2024
Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data

Kishan Panaganti, Adam Wierman, Eric Mazumdar

The robust $φ$-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust $φ$-regularized fitted Q-iteration (RPQ) for learning an $ε$-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $φ$-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $φ$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $φ$-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

LGOct 9, 2025
Conformal Risk Training: End-to-End Optimization of Conformal Risk Control

Christopher Yeh, Nicolas Christianson, Adam Wierman et al.

While deep learning models often achieve high predictive accuracy, their predictions typically do not come with any provable guarantees on risk or reliability, which are critical for deployment in high-stakes applications. The framework of conformal risk control (CRC) provides a distribution-free, finite-sample method for controlling the expected value of any bounded monotone loss function and can be conveniently applied post-hoc to any pre-trained deep learning model. However, many real-world applications are sensitive to tail risks, as opposed to just expected loss. In this work, we develop a method for controlling the general class of Optimized Certainty-Equivalent (OCE) risks, a broad class of risk measures which includes as special cases the expected loss (generalizing the original CRC method) and common tail risks like the conditional value-at-risk (CVaR). Furthermore, standard post-hoc CRC can degrade average-case performance due to its lack of feedback to the model. To address this, we introduce "conformal risk training," an end-to-end approach that differentiates through conformal OCE risk control during model training or fine-tuning. Our method achieves provable risk guarantees while demonstrating significantly improved average-case performance over post-hoc approaches on applications to controlling classifiers' false negative rate and controlling financial risk in battery storage operation.

LGMay 23, 2025
KL-regularization Itself is Differentially Private in Bandits and RLHF

Yizhou Zhang, Kishan Panaganti, Laixi Shi et al.

Differential Privacy (DP) provides a rigorous framework for privacy, ensuring the outputs of data-driven algorithms remain statistically indistinguishable across datasets that differ in a single entry. While guaranteeing DP generally requires explicitly injecting noise either to the algorithm itself or to its outputs, the intrinsic randomness of existing algorithms presents an opportunity to achieve DP ``for free''. In this work, we explore the role of regularization in achieving DP across three different decision-making problems: multi-armed bandits, linear contextual bandits, and reinforcement learning from human feedback (RLHF), in offline data settings. We show that adding KL-regularization to the learning objective (a common approach in optimization algorithms) makes the action sampled from the resulting stochastic policy itself differentially private. This offers a new route to privacy guarantees without additional noise injection, while also preserving the inherent advantage of regularization in enhancing performance.

LGDec 11, 2025
Fairness-Regularized Online Optimization with Switching Costs

Pengfei Li, Yuelin Han, Adam Wierman et al.

Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on the entire sequence of actions, we prove that even without switching costs, no online algorithms can possibly achieve a sublinear regret or finite competitive ratio compared to the offline optimal algorithm as the problem episode length $T$ increases. Then, we propose FairOBD (Fairness-regularized Online Balanced Descent), which reconciles the tension between minimizing the hitting cost, switching cost, and fairness cost. Concretely, FairOBD decomposes the long-term fairness cost into a sequence of online costs by introducing an auxiliary variable and then leverages the auxiliary variable to regularize the online actions for fair outcomes. Based on a new approach to account for switching costs, we prove that FairOBD offers a worst-case asymptotic competitive ratio against a novel benchmark -- the optimal offline algorithm with parameterized constraints -- by considering $T\to\infty$. Finally, we run trace-driven experiments of dynamic computing resource provisioning for socially responsible AI inference to empirically evaluate FairOBD, showing that FairOBD can effectively reduce the total fairness-regularized cost and better promote fair outcomes compared to existing baseline solutions.

DSNov 23, 2025
Online Smoothed Demand Management

Adam Lechowicz, Nicolas Christianson, Mohammad Hajiesmaili et al.

We introduce and study a class of online problems called online smoothed demand management $(\texttt{OSDM})$, motivated by paradigm shifts in grid integration and energy storage for large energy consumers such as data centers. In $\texttt{OSDM}$, an operator makes two decisions at each time step: an amount of energy to be purchased, and an amount of energy to be delivered (i.e., used for computation). The difference between these decisions charges (or discharges) the operator's energy storage (e.g., a battery). Two types of demand arrive online: base demand, which must be covered at the current time, and flexible demand, which can be satisfied at any time before a demand-specific deadline $Δ_t$. The operator's goal is to minimize a cost (subject to above constraints) that combines a cost of purchasing energy, a cost for delivering energy (if applicable), and smoothness penalties on the purchasing and delivery rates to discourage fluctuations and encourage ``grid healthy'' decisions. $\texttt{OSDM}$ generalizes several problems in the online algorithms literature while being the first to fully model applications of interest. We propose a competitive algorithm for $\texttt{OSDM}$ called $\texttt{PAAD}$ (partitioned accounting & aggregated decisions) and show it achieves the optimal competitive ratio. To overcome the pessimism typical of worst-case analysis, we also propose a novel learning framework that provides guarantees on the worst-case competitive ratio (i.e., to provide robustness against nonstationarity) while allowing end-to-end differentiable learning of the best algorithm on historical instances of the problem. We evaluate our algorithms in a case study of a grid-integrated data center with battery storage, showing that $\texttt{PAAD}$ effectively solves the problem and end-to-end learning achieves substantial performance improvements compared to $\texttt{PAAD}$.

LGOct 21, 2025
Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach

Chenbei Lu, Zaiwei Chen, Tongxin Li et al.

Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.

ROSep 23, 2025
SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

Yarden As, Chengrui Qu, Benjamin Unger et al.

Deploying reinforcement learning (RL) safely in the real world is challenging, as policies trained in simulators must face the inevitable sim-to-real gap. Robust safe RL techniques are provably safe, however difficult to scale, while domain randomization is more practical yet prone to unsafe behaviors. We address this gap by proposing SPiDR, short for Sim-to-real via Pessimistic Domain Randomization -- a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.

LGJul 21, 2025
Competitive Algorithms for Multi-Agent Ski-Rental Problems

Xuchuang Wang, Bo Sun, Hedyeh Beyhaghi et al.

This paper introduces a novel multi-agent ski-rental problem that generalizes the classical ski-rental dilemma to a group setting where agents incur individual and shared costs. In our model, each agent can either rent at a fixed daily cost, or purchase a pass at an individual cost, with an additional third option of a discounted group pass available to all. We consider scenarios in which agents' active days differ, leading to dynamic states as agents drop out of the decision process. To address this problem from different perspectives, we define three distinct competitive ratios: overall, state-dependent, and individual rational. For each objective, we design and analyze optimal deterministic and randomized policies. Our deterministic policies employ state-aware threshold functions that adapt to the dynamic states, while our randomized policies sample and resample thresholds from tailored state-aware distributions. The analysis reveals that symmetric policies, in which all agents use the same threshold, outperform asymmetric ones. Our results provide competitive ratio upper and lower bounds and extend classical ski-rental insights to multi-agent settings, highlighting both theoretical and practical implications for group decision-making under uncertainty.