MLLGMEDec 16, 2024

Generalized Bayesian deep reinforcement learning

arXiv:2412.11743v22 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the problem of scalable and efficient decision-making in reinforcement learning for researchers and practitioners, though it appears incremental as it builds upon existing Bayesian and deep learning techniques.

The paper tackles the challenge of Bayesian reinforcement learning in uncertain environments by proposing a method that uses deep generative models and a generalized predictive-sequential scoring rule posterior, with simulation studies showing improvements over traditional Thompson sampling in discrete action spaces and extension to continuous action spaces.

Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. As a model-based RL method, it has two key components: (1) inferring the posterior distribution of the model for the data-generating process (DGP) and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models, assuming Markov dependence. In the absence of likelihood functions for these models, we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We used sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high-dimensional parameter space of the neural networks, we use the gradient-based Markov kernels within SMC. To justify the use of the prequential scoring rule posterior, we prove a Bernstein-von Mises-type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximising the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions, which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies, assuming a discrete action space. Finally, we successfully extended our setup for a challenging problem with a continuous action space without theoretical guarantees.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes