ASLGSDIVMLMay 29, 2020

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

arXiv:2006.01595v138 citations
Originality Incremental advance
AI Analysis

This work addresses sound recognition for computational audio scene analysis, offering a multi-modal approach that is incremental but provides a strong specific gain.

The paper tackles sound recognition by proposing an audiovisual fusion model that uses an attention mechanism to combine audio and visual modalities, achieving a mean Average Precision of 46.16 on AudioSet, which outperforms prior state-of-the-art by 4.35 mAP (10.4% relative improvement).

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes