Smoothed Q-learning
This addresses a known bottleneck in reinforcement learning for improving algorithm stability and efficiency, though it appears incremental as it builds on existing Q-learning variants.
The paper tackles the overestimation problem in Q-learning by introducing an alternative algorithm that replaces the max operation with an average, resulting in a provably convergent off-policy method that mitigates overestimation while retaining similar convergence speed as standard Q-learning.
In Reinforcement Learning the Q-learning algorithm provably converges to the optimal solution. However, as others have demonstrated, Q-learning can also overestimate the values and thereby spend too long exploring unhelpful states. Double Q-learning is a provably convergent alternative that mitigates some of the overestimation issues, though sometimes at the expense of slower convergence. We introduce an alternative algorithm that replaces the max operation with an average, resulting also in a provably convergent off-policy algorithm which can mitigate overestimation yet retain similar convergence as standard Q-learning.