Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder
This work addresses a specific bottleneck in video understanding for researchers and practitioners, offering incremental improvements through decoupling techniques.
The paper tackles the problem of entangled visual and semantic features in video segmentation and grounding models, which degrades accuracy, by proposing DeSa2VA, a decoupling-enhanced prompting scheme that achieves state-of-the-art performance across multiple tasks like image and video segmentation and question answering.
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.