LGDSOCJun 13, 2021

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

arXiv:2106.07046v145 citations
Originality Incremental advance
AI Analysis

This work provides theoretical bounds for sample efficiency in reinforcement learning, addressing a fundamental challenge for researchers and practitioners in AI and control systems, though it is incremental in refining existing complexity analyses.

The paper tackles the problem of determining the sample complexity for finding an ε-optimal policy in infinite-horizon average-reward Markov decision processes (MDPs) with a generative model, proving an upper bound of Õ(t_mix ε^{-3}) samples per state-action pair and a lower bound showing linear dependence on mixing time is necessary.

We prove new upper and lower bounds for sample complexity of finding an $ε$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} ε^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes