Motion and Context-Aware Audio-Visual Conditioned Video Prediction
This work addresses the problem of audio-visual conditioned video prediction for applications like video generation and robotics, representing an incremental improvement over prior methods.
The paper tackles the challenge of predicting future video frames conditioned on audio-visual inputs by decoupling motion and appearance modeling, using multimodal motion estimation and context-aware refinement to improve long-term predictions, achieving competitive results on existing benchmarks.
The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context encoder and manipulated by motion-conditioned affine transformation before fusion with features of warped frames. Experimental results show that our method achieves competitive results on existing benchmarks.