Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts
This addresses the problem of efficient and coherent video editing for users in multimedia or AI applications, and it is incremental as it builds on existing diffusion-based techniques with novel components.
The authors tackled the problem of zero-shot, training-free video editing conditioned on both images and text, and the result was a method that outperforms state-of-the-art methods across all metrics.
We propose ImEdit, the first zero-shot, training-free video editing method conditioned on both images and text. The proposed method introduces $ρ$-start sampling and dilated dual masking to construct well-structured noise maps for coherent and accurate edits. We further present zero image guidance, a controllable negative prompt strategy, for visual fidelity. Both quantitative and qualitative evaluations show that our method outperforms state-of-the-art methods across all metrics.