SDAICVLGASMar 19, 2023

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

arXiv:2303.10757v131 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses efficient audio classification for applications like sound recognition, but it is incremental as it builds on existing Transformer-based methods.

The authors tackled efficient audio classification by developing a multiscale audio spectrogram Transformer (MAST) that uses hierarchical representation learning, achieving significant accuracy improvements over AST (e.g., 22.2% on Kinetics-Sounds) and being 5x more efficient with 42% fewer parameters.

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes