Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning
This work addresses the sample efficiency of reinforcement learning agents in real-world applications with fixed time horizons, such as tutoring or customer service, by providing near-tight bounds that improve on previous looser results.
The paper tackles the problem of deriving tight sample complexity bounds for episodic fixed-horizon reinforcement learning, achieving an upper bound of ̃O(|S|^2|A|H^2/ε^2 ln(1/δ)) and a matching lower bound up to log-terms and a linear state dependency, with the lower bound being the first of its kind in this setting.
Recently, there has been significant progress in understanding reinforcement learning in discounted infinite-horizon Markov decision processes (MDPs) by deriving tight sample complexity bounds. However, in many real-world applications, an interactive learning agent operates for a fixed or bounded period of time, for example tutoring students for exams or handling customer service requests. Such scenarios can often be better treated as episodic fixed-horizon MDPs, for which only looser bounds on the sample complexity exist. A natural notion of sample complexity in this setting is the number of episodes required to guarantee a certain performance with high probability (PAC guarantee). In this paper, we derive an upper PAC bound $\tilde O(\frac{|\mathcal S|^2 |\mathcal A| H^2}{ε^2} \ln\frac 1 δ)$ and a lower PAC bound $\tilde Ω(\frac{|\mathcal S| |\mathcal A| H^2}{ε^2} \ln \frac 1 {δ+ c})$ that match up to log-terms and an additional linear dependency on the number of states $|\mathcal S|$. The lower bound is the first of its kind for this setting. Our upper bound leverages Bernstein's inequality to improve on previous bounds for episodic finite-horizon MDPs which have a time-horizon dependency of at least $H^3$.