NELGMMSDJul 11, 2016

Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling

arXiv:1607.02857v120 citations
Originality Incremental advance
AI Analysis

This work addresses audio classification tasks for applications like scene recognition and tagging, presenting an incremental improvement over existing methods.

The paper tackled acoustic scene classification and domestic audio tagging by training an all-convolutional neural network with masked global pooling, achieving an average accuracy of 84.5% (vs. baseline 72.5%) and an average equal error rate of 0.17 (vs. baseline 0.21), improving baselines by 17% and 19% respectively.

We trained a deep all-convolutional neural network with masked global pooling to perform single-label classification for acoustic scene classification and multi-label classification for domestic audio tagging in the DCASE-2016 contest. Our network achieved an average accuracy of 84.5% on the four-fold cross-validation for acoustic scene recognition, compared to the provided baseline of 72.5%, and an average equal error rate of 0.17 for domestic audio tagging, compared to the baseline of 0.21. The network therefore improves the baselines by a relative amount of 17% and 19%, respectively. The network only consists of convolutional layers to extract features from the short-time Fourier transform and one global pooling layer to combine those features. It particularly possesses neither fully-connected layers, besides the fully-connected output layer, nor dropout layers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes