LGMLJun 21, 2020

Towards Tractable Optimism in Model-Based Reinforcement Learning

arXiv:2006.11911v228 citations
Originality Incremental advance
AI Analysis

This work addresses the scalability problem of optimistic RL algorithms for researchers and practitioners in reinforcement learning, offering a tractable method with incremental improvements over existing approaches.

The paper tackles the challenge of scaling optimistic model-based reinforcement learning to deep RL by reinterpreting it as solving a tractable noise-augmented MDP, achieving a competitive regret bound of $ ilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ with Gaussian noise. It also shows empirically that reducing estimation error allows optimistic model-based RL to match state-of-the-art performance in continuous control problems.

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes