LGOCMLFeb 23, 2020

Periodic Q-Learning

arXiv:2002.09795v117 citations
AI Analysis

This provides a preliminary justification for target networks in Q-learning, addressing a stability issue in deep reinforcement learning, though it is incremental as it focuses on theoretical analysis in a simplified setting.

The paper tackles the limited theoretical understanding of target networks in deep reinforcement learning by analyzing periodic Q-learning (PQ-learning) in the tabular setting, showing it achieves better sample complexity for finding an epsilon-optimal policy compared to standard Q-learning.

The use of target networks is a common practice in deep reinforcement learning for stabilizing the training; however, theoretical understanding of this technique is still limited. In this paper, we study the so-called periodic Q-learning algorithm (PQ-learning for short), which resembles the technique used in deep Q-learning for solving infinite-horizon discounted Markov decision processes (DMDP) in the tabular setting. PQ-learning maintains two separate Q-value estimates - the online estimate and target estimate. The online estimate follows the standard Q-learning update, while the target estimate is updated periodically. In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample complexity for finding an epsilon-optimal policy. Our result provides a preliminary justification of the effectiveness of utilizing target estimates or networks in Q-learning algorithms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes