LGApr 27, 2023

A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

Jacopo Germano, Francesco Emanuele Stradi, Gianmarco Genalti, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

arXiv:2304.14326v213.712 citationsh-index: 31

Originality Highly original

AI Analysis

This addresses the problem of safe and efficient learning in real-world applications like autonomous driving and recommender systems, where agents must obey constraints during learning, representing a foundational advance in constrained reinforcement learning.

The paper tackles online learning in constrained Markov decision processes (CMDPs) with long-term constraints, where rewards and constraints can be stochastic or adversarial, by proposing a best-of-both-worlds algorithm that matches state-of-the-art regret and constraint violation bounds for stochastic settings and provides the first guarantees for adversarial constraints.

We study online learning in episodic constrained Markov decision processes (CMDPs), where the learner aims at collecting as much reward as possible over the episodes, while satisfying some long-term constraints during the learning process. Rewards and constraints can be selected either stochastically or adversarially, and the transition function is not known to the learner. While online learning in classical (unconstrained) MDPs has received considerable attention over the last years, the setting of CMDPs is still largely unexplored. This is surprising, since in real-world applications, such as, e.g., autonomous driving, automated bidding, and recommender systems, there are usually additional constraints and specifications that an agent has to obey during the learning process. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with long-term constraints, in the flavor of Balseiro et al. (2023). Our algorithm is capable of handling settings in which rewards and constraints are selected either stochastically or adversarially, without requiring any knowledge of the underling process. Moreover, our algorithm matches state-of-the-art regret and constraint violation bounds for settings in which constraints are selected stochastically, while it is the first to provide guarantees in the case in which they are chosen adversarially.

View on arXiv PDF

Similar