CVFeb 24

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

arXiv:2602.20583v11 citationsh-index: 5
Originality Highly original
AI Analysis

This addresses the challenge of acquiring large-scale paired video datasets for video editing, offering a more efficient training approach for researchers and practitioners in video generation and editing.

The paper tackles the problem of training propagation-based video editing models without costly paired datasets by proposing PropFly, a pipeline that uses on-the-fly supervision from pre-trained video diffusion models, resulting in significantly outperforming state-of-the-art methods on various tasks.

Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes