LGAISTMLOct 9, 2023

When is Agnostic Reinforcement Learning Statistically Tractable?

arXiv:2310.06113v17 citationsh-index: 73
Originality Highly original
AI Analysis

This work addresses fundamental statistical tractability in reinforcement learning for researchers and practitioners, revealing a surprising separation between generative and online access models, which is incremental in advancing theoretical understanding.

The paper tackles the problem of agnostic PAC reinforcement learning by introducing a new complexity measure called spanning capacity to characterize learnability, showing that bounded spanning capacity ensures efficient learning with a generative model but not necessarily in online settings, where a superpolynomial sample requirement exists for some policy classes.

We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $Π$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $ε$-suboptimal policy with respect to $Π$? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set $Π$ and is independent of the MDP dynamics. With a generative model, we show that for any policy class $Π$, bounded spanning capacity characterizes PAC learnability. However, for online RL, the situation is more subtle. We show there exists a policy class $Π$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn. This reveals a surprising separation for agnostic learnability between generative access and online access models (as well as between deterministic/stochastic MDPs under online access). On the positive side, we identify an additional \emph{sunflower} structure, which in conjunction with bounded spanning capacity enables statistically efficient online RL via a new algorithm called POPLER, which takes inspiration from classical importance sampling methods as well as techniques for reachable-state identification and policy evaluation in reward-free exploration.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes