Double Successive Over-Relaxation Q-Learning with an Extension to Deep Reinforcement Learning
This work addresses convergence issues in reinforcement learning for practitioners, though it is incremental as it builds on existing SOR methods.
The paper tackles slow convergence in Q-learning by proposing a sample-based, model-free double SOR Q-learning algorithm, which reduces bias and is extended to deep RL, showing improved performance in tabular and large-scale environments.
Q-learning is a widely used algorithm in reinforcement learning (RL), but its convergence can be slow, especially when the discount factor is close to one. Successive Over-Relaxation (SOR) Q-learning, which introduces a relaxation factor to speed up convergence, addresses this issue but has two major limitations: In the tabular setting, the relaxation parameter depends on transition probability, making it not entirely model-free, and it suffers from overestimation bias. To overcome these limitations, we propose a sample-based, model-free double SOR Q-learning algorithm. Theoretically and empirically, this algorithm is shown to be less biased than SOR Q-learning. Further, in the tabular setting, the convergence analysis under boundedness assumptions on iterates is discussed. The proposed algorithm is extended to large-scale problems using deep RL. Finally, the tabular version of the proposed algorithm is compared using roulette and grid world environments, while the deep RL version is tested on a maximization bias example and OpenAI Gym environments.