IR AIFeb 10

Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

Yizhi Zhou, Jia-Qi Yang, De-Chuan Zhan, Da-Wei Zhou

arXiv:2604.20847h-index: 4Has Code

AI Analysis

For researchers in music recommendation, this work provides a standardized multimodal benchmark and a more efficient feature aggregation method, though it is incremental in nature.

The authors introduce TASTE, a multimodal dataset and benchmarking framework for music recommendation, and propose MuQ-token for efficient audio feature aggregation. Their approach outperforms existing methods in recall and CTR tasks, demonstrating the value of content-driven approaches.

Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large-scale self-supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the \textbf{MuQ-token} method, which enables more efficient integration of multi-layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content-driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at https://github.com/zreach/TASTE

View on arXiv PDF Code

Similar