Finite-Time Analysis for Double Q-learning
This work addresses a theoretical gap for researchers and practitioners in reinforcement learning, offering rigorous convergence guarantees for a widely used algorithm, though it is incremental in extending existing asymptotic results.
The paper tackles the lack of theoretical understanding of double Q-learning by providing the first finite-time analysis, showing that both synchronous and asynchronous versions converge to an ε-accurate neighborhood of the optimum with iteration bounds depending on discount factor and learning rate parameters.
Although Q-learning is one of the most successful algorithms for finding the best action-value function (and thus the optimal policy) in reinforcement learning, its implementation often suffers from large overestimation of Q-function values incurred by random sampling. The double Q-learning algorithm proposed in~\citet{hasselt2010double} overcomes such an overestimation issue by randomly switching the update between two Q-estimators, and has thus gained significant popularity in practice. However, the theoretical understanding of double Q-learning is rather limited. So far only the asymptotic convergence has been established, which does not characterize how fast the algorithm converges. In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $ε$-accurate neighborhood of the global optimum by taking $\tildeΩ\left(\left( \frac{1}{(1-γ)^6ε^2}\right)^{\frac{1}ω} +\left(\frac{1}{1-γ}\right)^{\frac{1}{1-ω}}\right)$ iterations, where $ω\in(0,1)$ is the decay parameter of the learning rate, and $γ$ is the discount factor. Our analysis develops novel techniques to derive finite-time bounds on the difference between two inter-connected stochastic processes, which is new to the literature of stochastic approximation.