LGAIOct 28, 2021

Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates

arXiv:2110.14818v14 citations
Originality Incremental advance
AI Analysis

This work addresses bias reduction in reinforcement learning for control tasks, but it is incremental as it builds on existing soft Q-learning methods by extending them to more complex settings.

The paper tackles the problem of bias in Temporal-Difference learning methods like Q-Learning, which overestimates Q values due to estimation noise, by introducing Unbiased Soft Q-Learning (UQL) that extends prior work to multi-action, infinite state spaces and provides a principled scheduling of the inverse temperature parameter using model uncertainty, showing effectiveness in experiments on discrete control environments.

Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step, and carries over to value estimations of other states, causing Q-Learning to overestimate the Q value. Algorithms like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which reduces the estimation bias via soft updates in early stages of training. However, the inverse temperature $β$ that controls the softness of an update is usually set by a hand-designed heuristic, which can be inaccurate at capturing the uncertainty in the target estimate. Under the belief that $β$ is closely related to the (state dependent) model uncertainty, Entropy Regularized Q-Learning (EQL) further introduces a principled scheduling of $β$ by maintaining a collection of the model parameters that characterizes model uncertainty. In this paper, we present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state space Markov Decision Processes. We also provide a principled numerical scheduling of $β$, extended from SQL and using model uncertainty, during the optimization process. We show the theoretical guarantees and the effectiveness of this update method in experiments on several discrete control environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes