SDLGMLSep 29, 2016

CNN Architectures for Large-Scale Audio Classification

arXiv:1609.09430v22944 citations
AI Analysis

This work addresses audio classification for multimedia applications, but it is incremental as it adapts existing image-based CNNs to audio data.

The paper tackled large-scale audio classification by applying various CNN architectures to a dataset of 70 million training videos, finding that models like AlexNet and ResNet performed well, with embeddings from these classifiers significantly improving performance on the Audio Set AED task.

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

Code Implementations16 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes