VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion
This work addresses the need for temporally coherent and controllable video matting for applications in video editing and computer vision, though it is incremental as it builds on existing diffusion models.
The authors tackled the problem of generating alpha mattes for specified instances in videos using text captions, achieving state-of-the-art results with a new method that leverages video diffusion models and a novel loss function, as demonstrated through extensive experiments.
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.