CVJun 22, 2017

Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

arXiv:1706.07156v1160 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of selecting optimal audio input representations for efficient CNN training in environmental sound classification, but it is incremental as it compares existing methods without introducing new ones.

The study compared various time-frequency representations for environmental sound classification using CNNs, finding that Mel-scaled STFT slightly outperformed other methods and significantly beat baseline MFCC features, with 2D convolution generally yielding better results than 1D.

Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. Visual displays of an audio signal, through various time-frequency representations such as spectrograms offer a rich representation of the temporal and spectral structure of the original signal. In this letter, we compare various popular signal processing methods to obtain this representation, such as short-time Fourier transform (STFT) with linear and Mel scales, constant-Q transform (CQT) and continuous Wavelet transform (CWT), and assess their impact on the classification performance of two environmental sound datasets using CNNs. This study supports the hypothesis that time-frequency representations are valuable in learning useful features for sound classification. Moreover, the actual transformation used is shown to impact the classification accuracy, with Mel-scaled STFT outperforming the other discussed methods slightly and baseline MFCC features to a large degree. Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes