LGAIFeb 6, 2022

Exploration with Multi-Sample Target Values for Distributional Reinforcement Learning

arXiv:2202.02693v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of enhancing exploration and value estimation in distributional RL for continuous control tasks, representing an incremental improvement over existing methods.

The paper tackles the problem of improving distributional reinforcement learning for continuous control by introducing multi-sample target values to replace single-sample estimation, combined with UCB-based exploration, resulting in state-of-the-art model-free performance on tasks like Humanoid control.

Distributional reinforcement learning (RL) aims to learn a value-network that predicts the full distribution of the returns for a given state, often modeled via a quantile-based critic. This approach has been successfully integrated into common RL methods for continuous control, giving rise to algorithms such as Distributional Soft Actor-Critic (DSAC). In this paper, we introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation, as commonly employed in current practice. The improved distributional estimates further lend themselves to UCB-based exploration. These two ideas are combined to yield our distributional RL algorithm, E2DC (Extra Exploration with Distributional Critics). We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control. We provide further insight into the method via visualization and analysis of the learned distributions and their evolution during training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes