Towards Tight Bounds on the Sample Complexity of Average-reward MDPs
This work provides theoretical bounds for sample efficiency in reinforcement learning, addressing a fundamental challenge for researchers and practitioners in AI and control systems, though it is incremental in refining existing complexity analyses.
The paper tackles the problem of determining the sample complexity for finding an ε-optimal policy in infinite-horizon average-reward Markov decision processes (MDPs) with a generative model, proving an upper bound of Õ(t_mix ε^{-3}) samples per state-action pair and a lower bound showing linear dependence on mixing time is necessary.
We prove new upper and lower bounds for sample complexity of finding an $ε$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} ε^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.