CVAIMMMar 21, 2024

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

arXiv:2403.14468v4124 citationsh-index: 16Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This addresses the need for more flexible and high-quality video editing tools for digital content creators, though it is incremental as it builds on existing image editing and generation models.

The paper tackled the problem of low-quality and limited control in video editing with generative models by introducing AnyV2V, a tuning-free framework that simplifies editing into two steps: modifying the first frame with an image editing model and generating the video via temporal feature injection, achieving comparable CLIP-scores and significantly outperforming baselines in human evaluations for visual consistency and quality.

In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes