LGFeb 7, 2021

Model-Augmented Q-learning

arXiv:2102.03866v11 citations
Originality Highly original
AI Analysis

This work provides a method to improve the stability and performance of Q-learning for reinforcement learning practitioners by mitigating estimation biases.

The paper addresses the under- and overestimation bias in Q-learning by proposing Model-augmented Q-learning (MQL), a model-free reinforcement learning framework augmented with model-based components. MQL estimates Q-values, transitions, and rewards with a shared network, using the estimated reward to improve Q-learning and achieve a policy-invariant solution identical to learning with true reward. It significantly improves performance and convergence of state-of-the-art off-policy MFRL methods.

In recent years, $Q$-learning has become indispensable for model-free reinforcement learning (MFRL). However, it suffers from well-known problems such as under- and overestimation bias of the value, which may adversely affect the policy learning. To resolve this issue, we propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network. We further utilize the estimated reward from the model estimators for $Q$-learning, which promotes interaction between the estimators. We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward. Finally, we also provide a trick to prioritize past experiences in the replay buffer by utilizing model-estimation errors. We experimentally validate MQL built upon state-of-the-art off-policy MFRL methods, and show that MQL largely improves their performance and convergence. The proposed scheme is simple to implement and does not require additional training cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes