CVMay 28, 2025

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool

arXiv:2505.22246v322.315 citationsh-index: 30

Originality Incremental advance

AI Analysis

This addresses the limitation of temporal coherence in diffusion-based world models for action-conditioned visual prediction, though it is incremental as it builds on existing methods.

The paper tackles the problem of world models losing long-term context in visual prediction, causing scene drift, and introduces StateSpaceDiffuser, which integrates state-space features into diffusion models to restore memory, resulting in coherent visual context for an order of magnitude more steps in 2D and 3D environments.

World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory. Project page: https://insait-institute.github.io/StateSpaceDiffuser/.

View on arXiv PDF

Similar