Autoregression-free video prediction using diffusion model for mitigating error propagation
This addresses a key limitation in video prediction for applications like robotics and autonomous driving, though it is incremental as it builds on existing diffusion model techniques.
The paper tackles error propagation in long-term video prediction by proposing an autoregression-free framework using diffusion models, which directly predicts future frames from context frames and outperforms state-of-the-art methods on two benchmark datasets.
Existing long-term video prediction methods often rely on an autoregressive video prediction mechanism. However, this approach suffers from error propagation, particularly in distant future frames. To address this limitation, this paper proposes the first AutoRegression-Free (ARFree) video prediction framework using diffusion models. Different from an autoregressive video prediction mechanism, ARFree directly predicts any future frame tuples from the context frame tuple. The proposed ARFree consists of two key components: 1) a motion prediction module that predicts a future motion using motion feature extracted from the context frame tuple; 2) a training method that improves motion continuity and contextual consistency between adjacent future frame tuples. Our experiments with two benchmark datasets show that the proposed ARFree video prediction framework outperforms several state-of-the-art video prediction methods.