CVSep 28, 2025

SVAC: Scaling Is All You Need For Referring Video Object Segmentation

Li Zhang, Haoxiang Gao, Zhihao Zhang, Luoxiao Huang, Tao Zhang

arXiv:2509.24109v16.21 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This work addresses RVOS challenges for video analysis applications, but it is incremental as it builds on existing MLLM-based methods with optimizations for scaling and efficiency.

The paper tackles the problem of Referring Video Object Segmentation (RVOS) by proposing SVAC, a model that scales up input frames and segmentation tokens to improve video-language interaction and segmentation precision, achieving state-of-the-art performance on multiple benchmarks with competitive efficiency.

Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.

View on arXiv PDF Code

Similar