CVMay 14

MiVE: Multiscale Vision-language features for reference-guided video Editing

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

arXiv:2605.1466494.6

AI Analysis

For video editing tasks requiring precise alignment of text instructions and reference images, MiVE provides a new paradigm that eliminates modality gaps and preserves fine-grained spatial details.

MiVE addresses reference-guided video editing by extracting hierarchical features from a VLM (Qwen3-VL) and integrating them into a unified self-attention Diffusion Transformer, achieving state-of-the-art performance with highest human preference rankings over both academic and commercial methods.

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

View on arXiv PDF

Similar