OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing
This work addresses a critical gap for researchers and practitioners in video editing by providing a foundational dataset and benchmark, though it is incremental as it builds on existing image editing datasets.
The authors tackled the scarcity of large-scale, high-quality datasets for instruction-based video editing by introducing OpenVE-3M, a dataset with diverse edit types and rigorous quality filtering, and their trained model OpenVE-Edit set a new state-of-the-art on the OpenVE-Bench benchmark, outperforming prior models including a 14B baseline.
The quality and diversity of instruction-based image editing datasets are continuously increasing, yet large-scale, high-quality datasets for instruction-based video editing remain scarce. To address this gap, we introduce OpenVE-3M, an open-source, large-scale, and high-quality dataset for instruction-based video editing. It comprises two primary categories: spatially-aligned edits (Global Style, Background Change, Local Change, Local Remove, Local Add, and Subtitles Edit) and non-spatially-aligned edits (Camera Multi-Shot Edit and Creative Edit). All edit types are generated via a meticulously designed data pipeline with rigorous quality filtering. OpenVE-3M surpasses existing open-source datasets in terms of scale, diversity of edit types, instruction length, and overall quality. Furthermore, to address the lack of a unified benchmark in the field, we construct OpenVE-Bench, containing 431 video-edit pairs that cover a diverse range of editing tasks with three key metrics highly aligned with human judgment. We present OpenVE-Edit, a 5B model trained on our dataset that demonstrates remarkable efficiency and effectiveness by setting a new state-of-the-art on OpenVE-Bench, outperforming all prior open-source models including a 14B baseline. Project page is at https://github.com/lewandofskee/OpenVE.