CVApr 1

The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

arXiv:2604.0040479.2h-index: 9Has Code
Predicted impact top 29% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the problem of segmenting objects in videos based on complex language descriptions for computer vision researchers, but it is incremental as it builds on existing models without introducing new training methods.

The authors tackled referring video object segmentation under motion-centric language expressions by developing a training-free pipeline that combines multimodal large language models with SAM3, achieving first place on the PVUW 2026 MeViS-Text test set with a Final score of 0.909064 and a J&F score of 0.7897.

This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes