Multiscale Audio Spectrogram Transformer for Efficient Audio Classification
This work addresses efficient audio classification for applications like sound recognition, but it is incremental as it builds on existing Transformer-based methods.
The authors tackled efficient audio classification by developing a multiscale audio spectrogram Transformer (MAST) that uses hierarchical representation learning, achieving significant accuracy improvements over AST (e.g., 22.2% on Kinetics-Sounds) and being 5x more efficient with 42% fewer parameters.
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.