RO CVMar 26

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

arXiv:2603.2568578.8h-index: 5

AI Analysis

This work addresses the challenge of stabilizing long-term predictions in robot world models, which is crucial for reliable simulation in robotics, though it is incremental as it builds on existing methods for diffusion models.

The paper tackles the problem of action-conditioned robot world models degrading in visual quality during multi-step autoregressive rollouts by introducing a reinforcement learning post-training scheme that trains the model on its own rollouts, resulting in state-of-the-art performance on the DROID dataset with metrics like LPIPS reduced by 14% and SSIM improved by 9.1%.

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

View on arXiv PDF

Similar