CVAIDec 18, 2025

EasyV2V: A High-quality Instruction-based Video Editing Framework

arXiv:2512.16920v11 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses video editing for users needing flexible control, but it appears incremental as it builds on existing pretrained models and data composition techniques.

The authors tackled the problem of instruction-based video editing, which faces challenges in consistency, control, and generalization, by introducing EasyV2V, a framework that achieves state-of-the-art results, surpassing concurrent and commercial systems.

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes