Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints
This addresses the challenge of optimizing dynamical systems with constraints for applications like robotics or resource management, but it is incremental as it builds on existing Q-learning and constrained optimization methods.
The paper tackles the problem of constrained Markov Decision Processes (CMDPs) with long-term constraints in a model-free setting where transition probabilities are unknown, proposing an algorithm that achieves an O(T^{1/2+γ}) regret bound for reward and O(T^{1-γ/2}) for constraint violation, which are the first such results for this scenario.
In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $γ\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+γ})$ regret bound for the obtained reward and $O(T^{1-γ/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.