SD ASDec 27, 2017

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

Shao-Yen Tseng, Juncheng Li, Yun Wang, Joseph Szurley, Florian Metze, Samarjit Das

arXiv:1712.09673v27.45 citationsh-index: 43

Originality Incremental advance

AI Analysis

This work addresses scalability issues in audio event detection for applications with limited computational resources, though it is incremental as it builds on existing multiple instance learning and embedding techniques.

The paper tackles the problem of audio event detection with expensive fine-grained labels by proposing a small-footprint multiple instance learning framework using weakly annotated data, achieving a 17% improvement in F1 score over a baseline system on AudioSet.

State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framework uses audio embeddings extracted from a pre-trained convolutional neural network as input features. We show that by using audio embeddings the MIL framework can be implemented using a simple DNN with performance comparable to recurrent neural networks. We evaluate our approach by training an audio tagging system using a subset of AudioSet, which is a large collection of weakly labeled YouTube video excerpts. Combined with a late-fusion approach, we improve the F1 score of a baseline audio tagging system by 17%. We show that audio embeddings extracted by the convolutional neural networks significantly boost the performance of all MIL models. This framework reduces the model complexity of the AED system and is suitable for applications where computational resources are limited.

View on arXiv PDF

Similar