OCLGMay 3, 2024

Regularized Q-learning through Robust Averaging

arXiv:2405.02201v2ICML
AI Analysis

This addresses a specific weakness in reinforcement learning for practitioners, but it is incremental as it builds on existing Q-learning frameworks.

The paper tackles the problem of estimation bias in Q-learning by proposing 2RA Q-learning, which uses a distributionally robust estimator to control bias, and shows it converges optimally and often outperforms existing methods in experiments.

We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We propose a distributionally robust estimator for the maximum expected value term, which allows us to precisely control the level of estimation bias introduced. The distributionally robust estimator admits a closed-form solution such that the proposed algorithm has a computational cost per iteration comparable to Watkins' Q-learning. For the tabular case, we show that 2RA Q-learning converges to the optimal policy and analyze its asymptotic mean-squared error. Lastly, we conduct numerical experiments for various settings, which corroborate our theoretical findings and indicate that 2RA Q-learning often performs better than existing methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes