ASLGIVDec 2, 2024

TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

arXiv:2412.01488v3h-index: 30
Originality Incremental advance
AI Analysis

This addresses the problem of segmenting image regions based on audio prompts for applications in multimedia analysis, but it is incremental as it builds on existing pre-trained models and NMF techniques.

The paper tackles sound-prompted segmentation by proposing a training-free method using Non-negative Matrix Factorization to co-factorize audio and visual features, achieving state-of-the-art performance in unsupervised segmentation and significantly surpassing previous methods.

Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes