LGFeb 5, 2024

Deep Exploration with PAC-Bayes

arXiv:2402.03055v55 citationsh-index: 17ECAI
AI Analysis

It addresses a critical bottleneck in applying deep exploration to real-world continuous control problems, representing an incremental advance over existing methods.

The paper tackles the problem of deep exploration in reinforcement learning for continuous control with delayed rewards by proposing a PAC-Bayesian actor-critic algorithm, which consistently discovers delayed rewards across tasks of varying difficulty.

Reinforcement learning (RL) for continuous control under delayed rewards is an under-explored problem despite its significance in real-world applications. Many complex skills are based on intermediate ones as prerequisites. For instance, a humanoid locomotor must learn how to stand before it can learn to walk. To cope with delayed reward, an agent must perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-softly on a randomly chosen actor head. Our proposed algorithm, named {\it PAC-Bayesian Actor-Critic (PBAC)}, is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes