Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
It addresses a fundamental bottleneck in reinforcement learning for researchers and practitioners by providing the first minimax-optimal guarantees across all sample sizes, which is not incremental but a breakthrough in theoretical understanding.
This paper tackles the sample efficiency problem in model-based reinforcement learning with a generative model, overcoming a severe sample size barrier by certifying minimax optimality for two algorithms with sample complexity reduced from at least |S||A|/(1-γ)^2 to order |S||A|/(1-γ) (modulo log factors) for infinite-horizon MDPs and achieving similar optimality for finite-horizon MDPs.
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).