LG MLMar 22, 2022

A Note on Target Q-learning For Solving Finite MDPs with A Generative Oracle

arXiv:2203.11489v17.86 citationsh-index: 45

Originality Synthesis-oriented

AI Analysis

This work provides a theoretical correction and analysis for reinforcement learning researchers, addressing incremental improvements in sample complexity bounds for target Q-learning.

The paper corrects a misleading claim about the sample complexity of target Q-learning in finite Markov Decision Processes with a generative oracle, establishing tight bounds that show it does not sacrifice sample complexity compared to vanilla Q-learning, with improvements under specific conditions.

Q-learning with function approximation could diverge in the off-policy setting and the target network is a powerful technique to address this issue. In this manuscript, we examine the sample complexity of the associated target Q-learning algorithm in the tabular case with a generative oracle. We point out a misleading claim in [Lee and He, 2020] and establish a tight analysis. In particular, we demonstrate that the sample complexity of the target Q-learning algorithm in [Lee and He, 2020] is $\widetilde{\mathcal O}(|\mathcal S|^2|\mathcal A|^2 (1-γ)^{-5}\varepsilon^{-2})$. Furthermore, we show that this sample complexity is improved to $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-γ)^{-5}\varepsilon^{-2})$ if we can sequentially update all state-action pairs and $\widetilde{\mathcal O}(|\mathcal S||\mathcal A| (1-γ)^{-4}\varepsilon^{-2})$ if $γ$ is further in $(1/2, 1)$. Compared with the vanilla Q-learning, our results conclude that the introduction of a periodically-frozen target Q-function does not sacrifice the sample complexity.

View on arXiv PDF

Similar