Acoustic Scene Classification with Spectrogram Processing Strategies
This work addresses acoustic scene classification for audio analysis applications, but it is incremental as it builds on existing CNN methods with new processing strategies.
The paper tackled acoustic scene classification by developing spectrogram processing strategies to efficiently use multiple or single spectrogram representations, achieving accuracies of 81.8% and 92.1% on DCASE 2020 datasets, significantly outperforming baselines of 54.1% and 87.3%.
Recently, convolutional neural networks (CNN) have achieved the state-of-the-art performance in acoustic scene classification (ASC) task. The audio data is often transformed into two-dimensional spectrogram representations, which are then fed to the neural networks. In this paper, we study the problem of efficiently taking advantage of different spectrogram representations through discriminative processing strategies. There are two main contributions. The first contribution is exploring the impact of the combination of multiple spectrogram representations at different stages, which provides a meaningful reference for the effective spectrogram fusion. The second contribution is that the processing strategies in multiple frequency bands and multiple temporal frames are proposed to make fully use of a single spectrogram representation. The proposed spectrogram processing strategies can be easily transferred to any network structures. The experiments are carried out on the DCASE 2020 Task1 datasets, and the results show that our method could achieve the accuracy of 81.8% (official baseline: 54.1%) and 92.1% (official baseline: 87.3%) on the officially provided fold 1 evaluation dataset of Task1A and Task1B, respectively.