CVAIMMApr 6, 2021

Strumming to the Beat: Audio-Conditioned Contrastive Video Textures

arXiv:2104.02687v121 citations
Originality Incremental advance
AI Analysis

This work addresses video synthesis challenges for applications in entertainment and media by improving on classic methods with modern self-supervised techniques, though it is incremental in combining existing contrastive learning with video textures.

The paper tackles the problem of generating infinite video textures from a single video by learning frame representations and transition probabilities via contrastive learning, enabling audio-conditioned synthesis without fine-tuning and achieving better human perceptual scores than baselines.

We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a manner that scales to more challenging dynamics, and to condition on other data, such as audio. We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high transition probabilities to generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends to an audio-conditioned setting without requiring any finetuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes