PAC Guarantees for Reinforcement Learning: Sample Complexity, Coverage, and Structure
This work addresses the need for reliable performance guarantees in RL for practitioners when data is scarce or mistakes are costly, though it is incremental as it surveys existing progress and provides an interpretive framework rather than new theoretical breakthroughs.
The paper tackles the problem of providing probably approximately correct (PAC) guarantees for reinforcement learning by surveying progress from 2018 to 2025 and introducing the Coverage-Structure-Objective (CSO) framework to decompose sample complexity results, identifying bottlenecks and enabling cross-setting comparisons.
When data is scarce or mistakes are costly, average-case metrics fall short. What a practitioner needs is a guarantee: with probability at least $1-δ$, the learned policy is $\varepsilon$-close to optimal after $N$ episodes. This is the PAC promise, and between 2018 and 2025 the RL theory community made striking progress on when such promises can be kept. We survey that progress. Our organizing tool is the Coverage-Structure-Objective (CSO) framework, proposed here, which decomposes nearly every PAC sample complexity result into three factors: coverage (how data were obtained), structure (intrinsic MDP or function-class complexity), and objective (what the learner must deliver). CSO is not a theorem but an interpretive template that identifies bottlenecks and makes cross-setting comparison immediate. The technical core covers tight tabular baselines and the uniform-PAC bridge to regret; structural complexity measures (Bellman rank, witness rank, Bellman-Eluder dimension) governing learnability with function approximation; results for linear, kernel/NTK, and low-rank models; reward-free exploration as upfront coverage investment; and pessimistic offline RL where inherited coverage is the binding constraint. We provide practitioner tools: rate lookup tables indexed by CSO coordinates, Bellman residual diagnostics, coverage estimation with deployment gates, and per-episode policy certificates. A final section catalogs open problems, separating near-term targets from frontier questions where coverage, structure, and computation tangle in ways current theory cannot resolve.