Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model
This work provides sample complexity bounds for CMDPs, which is significant for researchers and practitioners developing reinforcement learning algorithms that must operate under safety or resource constraints.
This paper addresses infinite-horizon linear constrained Markov decision processes (CMDPs) with a generative model, aiming to maximize expected cumulative reward under expected cumulative constraints. The authors propose a primal-dual framework that, when instantiated with mirror descent value iteration, achieves a sample complexity of Õ(d^2/((1-γ)^4ε^2)) for relaxed feasibility and Õ(d^2/((1-γ)^6ε^2ζ^2)) for strict feasibility, with near-optimal dependence on key parameters.
We consider infinite-horizon $γ$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration (\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $ε$-optimal policy with high probability by using $\tilde{O}\left(\frac{d^2}{(1-γ)^4ε^2}\right)$ samples. For (ii), we show that the algorithm requires $\tilde{O}\left(\frac{d^2}{(1-γ)^6ε^2ζ^2}\right)$ samples, where $ζ$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Furthermore, we prove a lower-bound of $Ω\left(\frac{d^2}{(1-γ)^5ε^2ζ^2}\right)$ for the strict feasibility setting. We note that our upper bounds under both settings exhibit a near-optimal dependence on $d$, $ε$, and $ζ$. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.