CV SD ASApr 2, 2024

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Tanvir Mahmud, Yapeng Tian, Diana Marculescu

arXiv:2404.01751v218.224 citationsh-index: 10Has CodeCVPR

Originality Incremental advance

AI Analysis

This work addresses the problem of multi-source sound localization in videos for applications like robotics and surveillance, representing an incremental advance by incorporating text modality to improve existing weakly supervised methods.

The paper tackles the challenge of accurately localizing visual sound sources in multi-source video mixtures by introducing a text-guided framework that uses tri-modal embeddings to disentangle semantic audio-visual correspondences, achieving significant performance improvements over state-of-the-art methods on datasets like MUSIC and VGGSound.

Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://github.com/enyac-group/T-VSL/tree/main

View on arXiv PDF Code

Similar