CVJun 2, 2025

LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model

Peking U

arXiv:2506.01546v114.46 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the challenge of generating consistent long-term driving videos, which is crucial for practical applications like autonomous driving simulation, though it appears incremental by building on existing diffusion transformer approaches.

The paper tackles the problem of error accumulation in long-term video generation for driving world models by proposing a hierarchical decoupling and cross-granularity distillation method, resulting in a 27% improvement in FVD and an 85% reduction in inference time for generating 110+ frames on the NuScenes benchmark.

Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by $27\%$ and reduces inference time by $85\%$ for the video task of generating 110+ frames. More videos (including 90s duration) are available at https://Wang-Xiaodong1899.github.io/longdwm/.

View on arXiv PDF

Similar