SDLGASNov 26, 2018

Combining High-Level Features of Raw Audio Waves and Mel-Spectrograms for Audio Tagging

arXiv:1811.10708v15 citations
Originality Incremental advance
AI Analysis

This work addresses audio tagging for applications like sound classification, but it is incremental as it builds on existing ensemble and feature combination techniques.

The paper tackles audio tagging by proposing a single-model method that combines high-level features from log-scaled mel-spectrograms and raw audio waves, using two CNNs and dense layers, achieving a ranking in the top 2% in the Freesound General-Purpose Audio Tagging Challenge.

In this paper, we describe our contribution to Task 2 of the DCASE 2018 Audio Challenge. While it has become ubiquitous to utilize an ensemble of machine learning methods for classification tasks to obtain better predictive performance, the majority of ensemble methods combine predictions rather than learned features. We propose a single-model method that combines learned high-level features computed from log-scaled mel-spectrograms and raw audio data. These features are learned separately by two Convolutional Neural Networks, one for each input type, and then combined by densely connected layers within a single network. This relatively simple approach along with data augmentation ranks among the best two percent in the Freesound General-Purpose Audio Tagging Challenge on Kaggle.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes