Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm
This work addresses the problem of average reward Constrained Markov Decision Processes for researchers and practitioners in the field of reinforcement learning, providing an incremental improvement over existing methods.
The authors tackled the problem of average reward Constrained Markov Decision Processes, achieving global convergence and constraint violation rates of $ ilde{mathcal{O}}(1/sqrt{T})$ when the mixing time is known, and $ ilde{mathcal{O}}(1/T^{0.5-ε})$ otherwise. Their algorithm achieves a high convergence rate, with rates that match the theoretical lower bound.
This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $\tilde{\mathcal{O}}(1/\sqrt{T})$ over a horizon of length $T$ when the mixing time, $τ_{\mathrm{mix}}$, is known to the learner. In absence of knowledge of $τ_{\mathrm{mix}}$, the achievable rates change to $\tilde{\mathcal{O}}(1/T^{0.5-ε})$ provided that $T \geq \tilde{\mathcal{O}}\left(τ_{\mathrm{mix}}^{2/ε}\right)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.