CVAug 17, 2023

Edit Temporal-Consistent Videos with Image Diffusion Model

arXiv:2308.09091v218 citationsh-index: 55
Originality Incremental advance
AI Analysis

This addresses the challenge of generating temporally consistent videos for video editing applications, representing an incremental improvement over existing methods.

The paper tackles the problem of temporal inconsistencies in text-guided video editing by proposing a Temporal-Consistent Video Editing (TCVE) method, which achieves state-of-the-art performance in video temporal consistency and editing capability, surpassing existing benchmarks.

Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this paper, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated videos while preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes