MMMar 23

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

arXiv:2603.2194848.1h-index: 2
Predicted impact top 63% in MM · last 90 daysOriginality Highly original
AI Analysis

This addresses the annotation bottleneck for researchers and practitioners in audio-visual scene understanding, though it is incremental as it builds on existing weakly supervised segmentation approaches.

The paper tackles the problem of costly per-frame annotations in audio-visual semantic segmentation by introducing a weakly supervised method that uses only video-level labels to generate per-frame semantic masks of sounding objects, achieving state-of-the-art performance among weakly supervised methods and remaining competitive with fully supervised baselines.

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes