CVMar 15

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

arXiv:2602.0777584.210 citationsh-index: 11
AI Analysis

This addresses a critical problem for video synthesis applications by bridging the gap between limited-horizon training and open-ended testing, though it is incremental as it builds on prior work like Self Forcing.

The paper tackles the train-test gap in autoregressive video diffusion models, where limited training durations cause visual degradation during open-ended testing; Rolling Sink, a training-free method, enables scaling to ultra-long durations (e.g., 5-30 minutes) with superior visual fidelity and temporal consistency compared to state-of-the-art baselines.

Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes