SDLGASMLApr 6, 2019

Spatio-Temporal Attention Pooling for Audio Scene Classification

arXiv:1904.03543v227 citations
Originality Incremental advance
AI Analysis

This work addresses audio scene classification for applications like smart devices, but it is incremental as it builds on existing neural network architectures.

The authors tackled acoustic scene classification by introducing a spatio-temporal attention pooling layer with a convolutional recurrent neural network to focus on discriminative patterns, achieving new state-of-the-art performance on the LITIS Rouen dataset.

Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with between-class examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes