LG OC MLOct 13, 2023

Optimal Sample Complexity for Average Reward Markov Decision Processes

Shengbo Wang, Jose Blanchet, Peter Glynn

Stanford

arXiv:2310.08833v213.017 citationsh-index: 6

Originality Highly original

AI Analysis

This work provides an optimal sample complexity result for reinforcement learning researchers and practitioners dealing with average reward MDPs, addressing a specific theoretical bottleneck.

The paper resolves the open question of sample complexity for policy learning in average reward Markov decision processes by developing an estimator that achieves the known lower bound of Ω(|S||A|t_mix ε^{-2}), matching it with an upper bound of ̃O(|S||A|t_mix ε^{-2}), thus bridging a gap of t_mix from prior results.

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 ε^{-2})$ and a lower bound of $Ω(|S||A|t_{\text{mix}} ε^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $ε$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of $\widetilde O(|S||A|t_{\text{mix}}ε^{-2})$. This marks the first algorithm and analysis to reach the literature's lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin and Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

View on arXiv PDF

Similar