AIMay 22, 2025

Enter the Void - Planning to Seek Entropy When Reward is Scarce

Ashish Sundar, Chunbo Luo, Xiaoyang Wang

arXiv:2505.16787v23.3

Originality Incremental advance

AI Analysis

This addresses sample efficiency and planning bottlenecks in MBRL for robotics and simulation domains, though it is incremental as it builds on existing methods like Dreamer.

The paper tackles the problem of improving world model fidelity and convergence time in model-based reinforcement learning by proposing a hierarchical planner that actively seeks high-entropy states using latent predictions, resulting in finishing Miniworld mazes 50% faster and converging in 60% of the environment steps compared to base Dreamer.

Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. MBRL methods have progressed by largely prioritising the actor; optimising the world model learning has been neglected meanwhile. Improving the fidelity of the world model and reducing its time to convergence can yield significant downstream benefits, one of which is improving the ensuing performance of any actor it may train. We propose a novel approach that anticipates and actively seeks out high-entropy states using short-horizon latent predictions generated by the world model, offering a principled alternative to traditional curiosity-driven methods that chase once-novel states well after they were stumbled into. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multi step plans after every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the weighting between reward and entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to just Dreamer as a proof of concept. Our method finishes the Miniworld procedurally generated mazes 50% faster than base Dreamer at convergence and the policy trained in imagination converges in only 60% of the environment steps that base Dreamer needs.

View on arXiv PDF

Similar