CVMay 14

MiVE: Multiscale Vision-language features for reference-guided video Editing

arXiv:2605.1466494.6
AI Analysis

For video editing tasks requiring precise alignment of text instructions and reference images, MiVE provides a new paradigm that eliminates modality gaps and preserves fine-grained spatial details.

MiVE addresses reference-guided video editing by extracting hierarchical features from a VLM (Qwen3-VL) and integrating them into a unified self-attention Diffusion Transformer, achieving state-of-the-art performance with highest human preference rankings over both academic and commercial methods.

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes