LGAIDSOCMLNov 1, 2021

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

arXiv:2111.00633v122 citations
Originality Highly original
AI Analysis

This resolves a fundamental theoretical question in RL about sample complexity scaling with horizon length, which is incremental but important for understanding algorithm efficiency.

The paper tackles the problem of horizon-dependence in reinforcement learning sample complexity by developing an algorithm that achieves an O(1)-optimal policy using only O(1) episodes of environment interactions, settling the question of whether polylog(H) dependence is necessary.

Recently there is a surge of interest in understanding the horizon-dependence of the sample complexity in reinforcement learning (RL). Notably, for an RL environment with horizon length $H$, previous work have shown that there is a probably approximately correct (PAC) algorithm that learns an $O(1)$-optimal policy using $\mathrm{polylog}(H)$ episodes of environment interactions when the number of states and actions is fixed. It is yet unknown whether the $\mathrm{polylog}(H)$ dependence is necessary or not. In this work, we resolve this question by developing an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions, completely settling the horizon-dependence of the sample complexity in RL. We achieve this bound by (i) establishing a connection between value functions in discounted and finite-horizon Markov decision processes (MDPs) and (ii) a novel perturbation analysis in MDPs. We believe our new techniques are of independent interest and could be applied in related questions in RL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes