CV AIDec 4, 2024

DIVE: Taming DINO for Subject-Driven Video Editing

Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen

arXiv:2412.03347v215.316 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses video editing challenges for AI and multimedia applications, representing an incremental improvement by leveraging existing models like DINOv2 and LoRAs.

The paper tackled the challenge of maintaining temporal consistency and motion alignment in subject-driven video editing by proposing DIVE, a framework that uses DINOv2 features to guide editing, achieving high-quality results with robust motion consistency in experiments on diverse real-world videos.

Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject's identity. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. Project page: https://dino-video-editing.github.io

View on arXiv PDF

Similar