CVDec 17, 2025

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

arXiv:2512.15702v120 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses a key challenge in video generation for applications like world simulation, though it is an incremental improvement over existing post-training approaches.

The paper tackles exposure bias in autoregressive video diffusion models by introducing Resampling Forcing, an end-to-end training framework that uses self-resampling and sparse causal masking to simulate inference errors during training, achieving performance comparable to distillation-based methods with improved temporal consistency on longer videos.

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes