CVJan 18, 2024

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

arXiv:2401.10226v238 citationsCVPR
Originality Highly original
AI Analysis

This addresses the tedious and labor-intensive process of manual mask labeling in video inpainting for researchers and practitioners in computer vision.

The paper tackles the problem of video inpainting by introducing a language-driven approach that uses natural language instructions instead of manual binary masks, resulting in the creation of the ROVI dataset with 5,650 videos and 9,091 inpainting results and a novel diffusion-based framework as the first end-to-end baseline.

We introduce a new task -- language-driven video inpainting, which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks, a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset, containing 5,650 videos and 9,091 inpainting results, to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework, the first end-to-end baseline for this task, integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We will make datasets, code, and models publicly available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes