SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments
This provides a controlled testbed for researchers to systematically evaluate world models, though it is incremental as it focuses on benchmarking rather than new model development.
The authors tackled the lack of a unified evaluation setting for world models by introducing the SmallWorld Benchmark to assess their capability in capturing environment dynamics, revealing how effectively models capture structure and how predictions deteriorate over extended rollouts.
Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.