AILGOCMLSep 15, 2017

The Uncertainty Bellman Equation and Exploration

arXiv:1709.05380v4213 citations
Originality Highly original
AI Analysis

This addresses the exploration challenge in reinforcement learning for agents in complex environments, offering a scalable method with significant performance gains, though it builds on existing Bellman equation concepts.

The paper tackles the exploration/exploitation problem in reinforcement learning by introducing an uncertainty Bellman equation (UBE) that extends exploratory benefits across time-steps, proving it provides a tighter upper bound on Q-value variance than traditional methods. Substituting UBE-exploration for ε-greedy improved DQN performance on 51 out of 57 Atari games.

We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $ε$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes