Tail Distribution of Regret in Optimistic Reinforcement Learning
This work addresses the need for more detailed regret analysis in reinforcement learning, offering insights into rare but large regret events, which is incremental but important for theoretical understanding and algorithm tuning.
The paper tackles the problem of characterizing the full tail distribution of regret in optimistic reinforcement learning for finite-horizon tabular Markov decision processes, deriving instance-dependent bounds that show a two-regime structure with sub-Gaussian and sub-Weibull tails, and providing corresponding expected regret bounds.
We derive instance-dependent tail bounds for the regret of optimism-based reinforcement learning in finite-horizon tabular Markov decision processes with unknown transition dynamics. Focusing on a UCBVI-type algorithm, we characterize the tail distribution of the cumulative regret $R_K$ over $K$ episodes, rather than only its expectation or a single high-probability quantile. We analyze two natural exploration-bonus schedules: (i) a $K$-dependent scheme that explicitly incorporates the total number of episodes $K$, and (ii) a $K$-independent scheme that depends only on the current episode index. For both settings, we obtain an upper bound on $\Pr(R_K \ge x)$ that exhibits a distinctive two-regime structure: a sub-Gaussian tail starting from an instance-dependent scale $m_K$ up to a transition threshold, followed by a sub-Weibull tail beyond that point. We further derive corresponding instance-dependent bounds on the expected regret $\mathbb{E}[R_K]$. The proposed algorithm depends on a tuning parameter $α$, which balances the expected regret and the range over which the regret exhibits a sub-Gaussian tail. To the best of our knowledge, our results provide one of the first comprehensive tail-regret guarantees for a standard optimistic algorithm in episodic reinforcement learning.