Multi-Modality in Music: Predicting Emotion in Music from High-Level Audio Features and Lyrics
This work addresses music emotion recognition for applications like recommendation systems, but it is incremental as it builds on existing datasets and methods.
The paper tackled music emotion recognition by comparing multi-modal (audio features and lyrics) to uni-modal approaches, finding that multi-modal features outperform audio alone for predicting valence, with 5 out of 11 audio features contributing most to performance.
This paper aims to test whether a multi-modal approach for music emotion recognition (MER) performs better than a uni-modal one on high-level song features and lyrics. We use 11 song features retrieved from the Spotify API, combined lyrics features including sentiment, TF-IDF, and Anew to predict valence and arousal (Russell, 1980) scores on the Deezer Mood Detection Dataset (DMDD) (Delbouys et al., 2018) with 4 different regression models. We find that out of the 11 high-level song features, mainly 5 contribute to the performance, multi-modal features do better than audio alone when predicting valence. We made our code publically available.