LGAINov 22, 2023

Probabilistic Inference in Reinforcement Learning Done Right

arXiv:2311.13294v113 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses a foundational issue in reinforcement learning for researchers and practitioners by improving inference accuracy and exploration efficiency, though it builds incrementally on existing methods like Thompson sampling.

The paper tackles the problem of poor approximations in probabilistic inference for reinforcement learning by providing a rigorous Bayesian treatment of state-action optimality, deriving a tractable variational approximation called VAPOR that ensures efficient exploration and demonstrates performance advantages in experiments.

A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes