LGAISYJun 12, 2021

Markov Decision Processes with Long-Term Average Constraints

arXiv:2106.06680v28 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of learning optimal policies under long-term average constraints in reinforcement learning, which is important for applications like resource management, but it is incremental as it builds on existing posterior sampling methods.

The paper tackles the problem of constrained Markov Decision Processes (CMDPs) with long-term average constraints, proposing CMDP-PSRL, a posterior sampling algorithm that achieves regret bounds of $ ilde{O}(poly(DSA)\sqrt{T})$ for reward and constraint violations, which is the first $ ilde{O}(\sqrt{T})$ result for ergodic MDPs with such constraints.

We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with a unichain Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are $K$ cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with $S$ states, $A$ actions, and diameter $D$, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by $\Tilde{O}(poly(DSA)\sqrt{T})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(poly(DSA)\sqrt{T})$. To the best of our knowledge, this is the first work which obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes