2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC
This addresses robustness issues in video object segmentation for applications like video editing, but it is incremental as it applies an existing method to a new dataset.
The paper tackled the problem of semi-supervised video object segmentation by evaluating the Segment Concept (SeC) framework's zero-shot performance on the MOSEv2 dataset, achieving 39.7 JFn and ranking 2nd place in a challenge.
Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 \JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.