ROAIMay 31, 2025

WorldGym: World Model as An Environment for Policy Evaluation

arXiv:2506.00613v321 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the costly and manual effort of real-world testing and simulator development for robotics, though it is incremental as it builds on existing world model and vision-language model techniques.

The paper tackles the problem of evaluating robot control policies by proposing WorldGym, a world-model-based environment that uses an autoregressive video generation model as a proxy to real-world testing, showing that policy success rates in the model highly correlate with real-world success rates (e.g., preserving relative rankings across different policy versions).

Evaluating robot control policies is difficult: real-world testing is costly, and handcrafted simulators require manual effort to improve in realism and generality. We propose a world-model-based policy evaluation environment (WorldGym), an autoregressive, action-conditioned video generation model which serves as a proxy to real world environments. Policies are evaluated via Monte Carlo rollouts in the world model, with a vision-language model providing rewards. We evaluate a set of VLA-based real-robot policies in the world model using only initial frames from real robots, and show that policy success rates within the world model highly correlate with real-world success rates. Moreoever, we show that WorldGym is able to preserve relative policy rankings across different policy versions, sizes, and training checkpoints. Due to requiring only a single start frame as input, the world model further enables efficient evaluation of robot policies' generalization ability on novel tasks and environments. We find that modern VLA-based robot policies still struggle to distinguish object shapes and can become distracted by adversarial facades of objects. While generating highly realistic object interaction remains challenging, WorldGym faithfully emulates robot motions and offers a practical starting point for safe and reproducible policy evaluation before deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes