CVAug 19, 2024

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong

arXiv:2408.10129v26.54 citationsh-index: 7

Originality Synthesis-oriented

AI Analysis

This is an incremental solution for researchers in video segmentation, addressing the specific challenge of motion-based referring segmentation in videos.

The authors tackled the referring video object segmentation (RVOS) task on the new MeViS benchmark, which uses motion descriptions instead of static attributes, by integrating existing RVOS and VOS models with semi-supervised learning, achieving 62.57 J&F and first place in the LSVOS Challenge RVOS Track.

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.

View on arXiv PDF

Similar