LGMLAug 5, 2021

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

arXiv:2108.02717v246 citations
AI Analysis

This addresses a core theoretical problem in reinforcement learning for researchers, providing a novel approach to policy identification with instance-dependent guarantees, though it is incremental in advancing existing theory.

The paper tackles the problem of identifying ε-optimal policies in reinforcement learning, showing a fundamental tradeoff between low regret and instance-optimal rates, and proposes a new algorithm that achieves instance-dependent sample complexity with significant improvements over worst-case bounds, scaling with suboptimality gaps and state reachability.

The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $ε$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $ε$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $ε$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes