CVSep 19, 2025

Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution

arXiv:2509.15781v1h-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses robust video object segmentation for applications like video editing and autonomous driving, but it is incremental as it combines existing methods.

The authors tackled video object segmentation by integrating the encoders of Cutie and SAM2 with a motion prediction module, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge.

Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes