Efficient exploration with Double Uncertain Value Networks
This addresses exploration challenges for reinforcement learning agents, particularly in complex environments, though it appears incremental as it builds on existing uncertainty estimation methods.
The paper tackles the problem of directed exploration in reinforcement learning by tracking two sources of uncertainty about action values: parametric uncertainty from limited data and return uncertainty from return distributions. The result is a Double Uncertain Value Network that jointly estimates these uncertainties, showing improved learning in domains with strong exploration challenges.
This paper studies directed exploration for reinforcement learning agents by tracking uncertainty about the value of each available action. We identify two sources of uncertainty that are relevant for exploration. The first originates from limited data (parametric uncertainty), while the second originates from the distribution of the returns (return uncertainty). We identify methods to learn these distributions with deep neural networks, where we estimate parametric uncertainty with Bayesian drop-out, while return uncertainty is propagated through the Bellman equation as a Gaussian distribution. Then, we identify that both can be jointly estimated in one network, which we call the Double Uncertain Value Network. The policy is directly derived from the learned distributions based on Thompson sampling. Experimental results show that both types of uncertainty may vastly improve learning in domains with a strong exploration challenge.