CVMar 10, 2024

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

arXiv:2403.06269v28 citationsh-index: 3WACV
Originality Incremental advance
AI Analysis

This addresses the problem of slow video editing for real-time applications, though it is incremental as it builds on existing diffusion and consistency model frameworks.

The paper tackles the computational inefficiency of text-to-video editing by proposing FastVideoEdit, which uses Consistency Models to eliminate inversion steps, achieving state-of-the-art performance with faster editing speeds while maintaining quality.

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes