CVMar 24

SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

arXiv:2603.2273214.3h-index: 3

AI Analysis

This work addresses audio-visual localization and segmentation for multimodal AI applications, representing an incremental improvement over existing methods.

The paper tackled the challenge of applying CLIP models to audio-visual localization by proposing SOUPLE, which uses learnable context tokens to bridge audio and visual semantics, resulting in improved performance on datasets like VGGSound and AVSBench.

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

View on arXiv PDF

Similar