LG RO SYMar 1, 2023

The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms

Anirudh Vemula, Yuda Song, Aarti Singh, J. Andrew Bagnell, Sanjiban Choudhury

CMU

arXiv:2303.00694v117.015 citationsh-index: 65Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency and alignment challenges in MBRL for researchers and practitioners, though it appears incremental as it builds on existing methods with specific improvements.

The paper tackles computational expense and objective mismatch in Model-based Reinforcement Learning by introducing a 'lazy' method with a unified objective, Performance Difference via Advantage in Model, which boosts computational efficiency and aligns model fitting with policy computation, demonstrated through simulated benchmarks.

We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL): the computational expense of repeatedly finding a good policy in the learned model, and the objective mismatch between model fitting and policy computation. Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy under the true dynamics. This objective demonstrates that optimizing the expected policy advantage in the learned model under an exploration distribution is sufficient for policy computation, resulting in a significant boost in computational efficiency compared to traditional planning methods. Additionally, the unified objective uses a value moment matching term for model fitting, which is aligned with the model's usage during policy computation. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains compared to existing MBRL methods through simulated benchmarks.

View on arXiv PDF Code

Similar