SD ASJun 12, 2019

Toward Interpretable Music Tagging with Self-Attention

arXiv:1906.04972v121.285 citations

Originality Synthesis-oriented

AI Analysis

This work addresses interpretability in music information retrieval for researchers and practitioners, though it is incremental as it adapts existing self-attention methods to a new domain.

The authors tackled music tagging by proposing a self-attention-based deep sequence model, which achieved competitive results on datasets like MagnaTagATune and Million Song Dataset while offering improved interpretability through heat map visualizations.

Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. The transformer, which is a sequence model solely based on self-attention, and its variants achieved state-of-the-art results in many natural language processing tasks. Since music composes its semantics based on the relations between components in sparse positions, adopting the self-attention mechanism to solve music information retrieval (MIR) problems can be beneficial. Hence, we propose a self-attention based deep sequence model for music tagging. The proposed architecture consists of shallow convolutional layers followed by stacked Transformer encoders. Compared to conventional approaches using fully convolutional or recurrent neural networks, our model is more interpretable while reporting competitive results. We validate the performance of our model with the MagnaTagATune and the Million Song Dataset. In addition, we demonstrate the interpretability of the proposed architecture with a heat map visualization.

View on arXiv PDF

Similar