CVSep 22, 2025

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

arXiv:2509.17537v24 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses the challenge of cross-modal reasoning and fine-grained object localization in referring audio-visual segmentation, representing an incremental improvement over existing methods.

The paper tackles the problem of segmenting objects in videos based on natural language expressions involving audio, vision, and text, proposing SimToken, which integrates a multimodal large language model with the Segment Anything Model to achieve superior performance on the Ref-AVS benchmark.

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes