CVDec 17, 2023

VidToMe: Video Token Merging for Zero-Shot Video Editing

arXiv:2312.10656v2120 citationsh-index: 9CVPR
Originality Incremental advance
AI Analysis

This addresses the challenge of generating consistent and memory-efficient videos for video editing applications, representing an incremental improvement in the field.

The paper tackles the problem of maintaining temporal consistency and reducing memory consumption in zero-shot video editing using pre-trained image diffusion models, achieving improved temporal coherence and efficiency over state-of-the-art methods.

Diffusion models have made significant advances in generating high-quality images, but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless, existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames, our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames, facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing, we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging, ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing, rendering favorable results in temporal consistency over state-of-the-art methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes