LG MLJun 13, 2022

Near-Optimal Sample Complexity Bounds for Constrained MDPs

Sharan Vaswani, Lin F. Yang, Csaba Szepesvári

DeepMind

arXiv:2206.06270v221.946 citationsh-index: 77

Originality Highly original

AI Analysis

This resolves a fundamental open problem in reinforcement learning theory for researchers, providing optimal statistical bounds for CMDPs, which is incremental in extending MDP results to constrained settings.

The paper tackles the unknown sample complexity for solving constrained Markov decision processes (CMDPs) by providing minimax upper and lower bounds, showing that learning is as easy as unconstrained MDPs with small constraint violations but inherently harder with zero violation, with sample complexities of $ ilde{O}\left(rac{S A \log(1/\delta)}{(1 - \gamma)^3 \epsilon^2} ight)$ and $ ilde{O} \left(rac{S A \log(1/\delta)}{(1 - \gamma)^5 \epsilon^2 \zeta^2} ight)$ respectively.

In contrast to the advances in characterizing the sample complexity for solving Markov decision processes (MDPs), the optimal statistical complexity for solving constrained MDPs (CMDPs) remains unknown. We resolve this question by providing minimax upper and lower bounds on the sample complexity for learning near-optimal policies in a discounted CMDP with access to a generative model (simulator). In particular, we design a model-based algorithm that addresses two settings: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to satisfy the constraint. For (i), we prove that our algorithm returns an $ε$-optimal policy with probability $1 - δ$, by making $\tilde{O}\left(\frac{S A \log(1/δ)}{(1 - γ)^3 ε^2}\right)$ queries to the generative model, thus matching the sample-complexity for unconstrained MDPs. For (ii), we show that the algorithm's sample complexity is upper-bounded by $\tilde{O} \left(\frac{S A \, \log(1/δ)}{(1 - γ)^5 \, ε^2 ζ^2} \right)$ where $ζ$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Finally, we prove a matching lower-bound for the strict feasibility setting, thus obtaining the first near minimax optimal bounds for discounted CMDPs. Our results show that learning CMDPs is as easy as MDPs when small constraint violations are allowed, but inherently more difficult when we demand zero constraint violation.

View on arXiv PDF

Similar