Factors of Influence of the Overestimation Bias of Q-Learning
This work addresses the overestimation bias problem in Q-Learning for reinforcement learning practitioners, though it appears incremental as it builds on existing methods by tuning parameters.
The study investigated how the learning rate, discount factor, and reward signal influence the overestimation bias in Q-Learning, finding that all three parameters significantly affect it and that careful tuning can lead to more accurate value estimates than other model-free methods.
We study whether the learning rate $α$, the discount factor $γ$ and the reward signal $r$ have an influence on the overestimation bias of the Q-Learning algorithm. Our preliminary results in environments which are stochastic and that require the use of neural networks as function approximators, show that all three parameters influence overestimation significantly. By carefully tuning $α$ and $γ$, and by using an exponential moving average of $r$ in Q-Learning's temporal difference target, we show that the algorithm can learn value estimates that are more accurate than the ones of several other popular model-free methods that have addressed its overestimation bias in the past.