Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning
This exposes fundamental limitations in model-based RL for researchers, showing no universal solution across benchmarks.
The paper investigates a performance gap in Dyna-style model-based reinforcement learning algorithms, showing they work well in OpenAI Gym but degrade significantly in DeepMind Control Suite despite similar tasks, with synthetic rollouts harming performance across most DMC environments.
Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process -- the backbone of Dyna-style algorithms -- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.