LGMay 23, 2024

Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

arXiv:2405.14372v27.95 citationsh-index: 17

Originality Incremental advance

AI Analysis

This work addresses the challenge of constrained reinforcement learning in non-stationary environments for researchers and practitioners, offering a theoretical framework to ease prior impossibility results, though it is incremental in building on existing CMDP literature.

The paper tackles the problem of learning in constrained Markov decision processes (CMDPs) with non-stationary rewards and constraints, where prior impossibility results prevent sublinear regret and constraint violation in adversarial settings. It proposes algorithms that achieve $ ilde{\mathcal{O}} (\sqrt{T} + C)$ regret and positive constraint violation under bandit feedback, with performance degrading smoothly as non-stationarity measured by corruption value $C$ increases, and extends this to unknown $C$ via a meta-procedure applicable to broader non-stationary constrained online learning.

In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining both sublinear regret and sublinear constraint violation, when competing against a best-in-hindsight policy that satisfies constraints on average. In this paper, we show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases. Specifically, we propose algorithms attaining $\tilde{\mathcal{O}} (\sqrt{T} + C)$ regret and positive constraint violation under bandit feedback, where $C$ is a corruption value measuring the environment non-stationarity. This can be $Θ(T)$ in the worst case, coherently with the impossibility result for adversarial CMDPs. First, we design an algorithm with the desired guarantees when $C$ is known. Then, in the case $C$ is unknown, we show how to obtain the same results by embedding such an algorithm in a general meta-procedure. This is of independent interest, as it can be applied to any non-stationary constrained online learning setting.

View on arXiv PDF

Similar