CVMMApr 6, 2023

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation

arXiv:2304.02970v737 citationsh-index: 61
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of accurate cross-modal alignment for audio-visual segmentation, though it is incremental as it builds on existing methods with new data and training strategies.

The paper tackles the problem of biased datasets and poor generalization in audio-visual segmentation by proposing a new cost-effective benchmark and a sample mining method for contrastive learning, achieving state-of-the-art segmentation accuracy.

Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files, and 2) a model that can establish strong links between audio information and its corresponding visual object. However, these requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore, experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes