Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion
This work addresses the issue of temporal inconsistency in video editing for users relying on pre-trained diffusion models, offering a practical solution with enhanced fidelity and coherence, though it is incremental as it builds on existing adapter and inversion methods.
The paper tackles the problem of poor temporal consistency in low-cost video editing using text-to-image diffusion models by proposing a General and Efficient Adapter (GE-Adapter) with temporal-spatial and semantic consistency components, resulting in significant improvements in perceptual quality, text-image alignment, and temporal coherence on the MSR-VTT dataset.
Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions via temporally-aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; and (3) Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment using shared prompt tokens and frame-specific tokens. Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset. Additionally, it achieves enhanced fidelity and frame-to-frame coherence, offering a practical solution for T2V editing.