SDASJul 12, 2020

Learning Frame Level Attention for Environmental Sound Classification

arXiv:2007.07241v1
Originality Incremental advance
AI Analysis

This work improves classification for environmental sound analysis, but it is incremental as it builds on existing attention methods.

The paper tackled environmental sound classification by addressing irrelevant and silent frames using a frame-level attention mechanism, achieving state-of-the-art or competitive accuracy on ESC-50 and ESC-10 datasets with lower computational complexity.

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose a convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. We investigated the classification performance when using different attention scaling function and applying different layers. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and our method achieved the state-of-the-art or competitive classification accuracy with lower computational complexity. We also visualized our attention results and observed that the proposed attention mechanism was able to lead the network tofocus on the semantically relevant parts of environmental sounds.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes