Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification
This work improves environmental sound classification, which is useful for applications like surveillance and smart devices, but it is incremental as it builds on existing attention and neural network methods.
The paper tackled environmental sound classification by addressing irrelevant and silent frames using a frame-level attention mechanism integrated with a convolutional recurrent neural network, achieving state-of-the-art classification accuracy on ESC-50 and ESC-10 datasets.
Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The ESC performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose an convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and achieved the state-of-the-art performance in terms of classification accuracy.