CVMar 28

SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track

arXiv:2603.2724172.4h-index: 16
AI Analysis

For researchers in video object segmentation, this work provides a simple yet effective verification strategy that improves performance on motion-centric referring tasks, though it is an incremental improvement over an existing method.

The authors extended the SaSaSa2VA model with a target existence-aware verification mechanism for motion-centric referring video object segmentation, achieving a final score of 89.19 and 2nd place in the 5th PVUW MeViS-Text Track.

Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes