SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
This work addresses the challenge of cross-modal reasoning and fine-grained object localization in referring audio-visual segmentation, representing an incremental improvement over existing methods.
The paper tackles the problem of segmenting objects in videos based on natural language expressions involving audio, vision, and text, proposing SimToken, which integrates a multimodal large language model with the Segment Anything Model to achieve superior performance on the Ref-AVS benchmark.
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.