SD AI IR LG MM ASOct 7, 2021

Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or ....

arXiv:2110.03183v54.33 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of reducing computational complexity in audio understanding for researchers and practitioners by offering a simpler, non-neural alternative that is competitive with state-of-the-art methods.

The paper tackles large-scale audio understanding by proposing a method based on Bag-of-Words models with clustered embeddings and MLP heads, which surpasses convolutional neural networks and comes close to outperforming Transformer architectures without using traditional neural components like convolutions, attention, or recurrence.

This paper presents a way of doing large scale audio understanding without traditional state of the art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional architectures have been able to achieve state of the art results surpassing traditional hand-crafted features. In the recent past, there has been a similar shift away from traditional convolutional and recurrent neural networks towards purely end-to-end Transformer architectures. We, in this work, explore an approach, based on Bag-of-Words model. Our approach does not have any convolutions, recurrence, attention, transformers or other approaches such as BERT. We utilize micro and macro level clustered vanilla embeddings, and use a MLP head for classification. We only use feed-forward encoder-decoder models to get the bottlenecks of spectral envelops, spectral patches and slices as well as multi-resolution spectra. A classification head (a feed-forward layer), similar to the approach in SimCLR is trained on a learned representation. Using simple codes learned on latent representations, we show how we surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures. This work hopefully would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.

View on arXiv PDF

Similar