CVMar 24

SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

arXiv:2603.2273214.3h-index: 3
AI Analysis

This work addresses audio-visual localization and segmentation for multimodal AI applications, representing an incremental improvement over existing methods.

The paper tackled the challenge of applying CLIP models to audio-visual localization by proposing SOUPLE, which uses learnable context tokens to bridge audio and visual semantics, resulting in improved performance on datasets like VGGSound and AVSBench.

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes