LG NI SY OC MLJun 10, 2020

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

Qinbo Bai, Vaneet Aggarwal, Ather Gattami

arXiv:2006.05961v26.56 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of optimizing dynamical systems with constraints for applications like robotics or resource management, but it is incremental as it builds on existing Q-learning and constrained optimization methods.

The paper tackles the problem of constrained Markov Decision Processes (CMDPs) with long-term constraints in a model-free setting where transition probabilities are unknown, proposing an algorithm that achieves an O(T^{1/2+γ}) regret bound for reward and O(T^{1-γ/2}) for constraint violation, which are the first such results for this scenario.

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $γ\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+γ})$ regret bound for the obtained reward and $O(T^{1-γ/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.

View on arXiv PDF

Similar