$Q$-learning with Logarithmic Regret
It provides the first non-asymptotic proof for a model-free algorithm achieving logarithmic regret, addressing a key theoretical challenge in reinforcement learning for researchers and practitioners.
This paper tackles the problem of achieving logarithmic cumulative regret in episodic tabular reinforcement learning with a strictly positive sub-optimality gap, proving that optimistic Q-learning achieves a bound that matches the information-theoretical lower bound up to a logarithmic factor.
This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot \mathrm{poly}\left(H\right)}{Δ_{\min}}\log\left(SAT\right)\right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Δ_{\min}$ is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of $S,A,T$ up to a $\log\left(SA\right)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.