CVSep 23, 2025

Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

arXiv:2509.19082v21 citationsh-index: 19Has Code
AI Analysis

This work addresses a specific bottleneck in referring video segmentation for researchers and practitioners, though it is incremental as it builds on an existing model.

The authors tackled inconsistencies between training and inference in Sa2VA, a model for language-guided dense grounding, and improved its performance on referring video object segmentation tasks, achieving up to +11.6 J&F on MeViS and setting new state-of-the-art results on multiple benchmarks.

Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes