LGITDec 28, 2015

Taming the Noise in Reinforcement Learning via Soft Updates

arXiv:1512.08562v4372 citations
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in reinforcement learning for noisy environments, offering incremental improvements in convergence rates.

The paper tackles the problem of poor performance in model-free reinforcement learning in noisy environments by proposing G-learning, a new off-policy algorithm that regularizes value estimates to reduce bias, resulting in faster convergence to optimal policies and lower learning costs.

Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes