Visualizing and Understanding Self-attention based Music Tagging
This work addresses interpretability for researchers and practitioners in music information retrieval, but it is incremental as it builds on a previously proposed self-attention model.
The paper tackles the problem of interpreting self-attention mechanisms in music tagging models, focusing on visualizing how these models process music as temporal sequences rather than images, with results indicating improved interpretability.
Recently, we proposed a self-attention based music tagging model. Different from most of the conventional deep architectures in music information retrieval, which use stacked 3x3 filters by treating music spectrograms as images, the proposed self-attention based model attempted to regard music as a temporal sequence of individual audio events. Not only the performance, but it could also facilitate better interpretability. In this paper, we mainly focus on visualizing and understanding the proposed self-attention based music tagging model.