Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation
This work addresses the need for updated evaluation methods in music recommendation for researchers and practitioners, as it is incremental in synthesizing existing knowledge rather than introducing new techniques.
This paper tackles the challenge of evaluating music recommender systems as they shift from traditional information-retrieval methods to large language models, arguing that standard accuracy metrics are inadequate and proposing a structured framework for assessment. It reviews how LLMs reshape user and item modeling, synthesizes insights from NLP evaluation practices, and outlines success and risk dimensions to guide the community.
Music Recommender Systems (MRS) have long relied on an information-retrieval framing, where progress is measured mainly through accuracy on retrieval-oriented subtasks. While effective, this reductionist paradigm struggles to address the deeper question of what makes a good recommendation, and attempts to broaden evaluation, through user studies or fairness analyses, have had limited impact. The emergence of Large Language Models (LLMs) disrupts this framework: LLMs are generative rather than ranking-based, making standard accuracy metrics questionable. They also introduce challenges such as hallucinations, knowledge cutoffs, non-determinism, and opaque training data, rendering traditional train/test protocols difficult to interpret. At the same time, LLMs create new opportunities, enabling natural-language interaction and even allowing models to act as evaluators. This work argues that the shift toward LLM-driven MRS requires rethinking evaluation. We first review how LLMs reshape user modeling, item modeling, and natural-language recommendation in music. We then examine evaluation practices from NLP, highlighting methodologies and open challenges relevant to MRS. Finally, we synthesize insights-focusing on how LLM prompting applies to MRS, to outline a structured set of success and risk dimensions. Our goal is to provide the MRS community with an updated, pedagogical, and cross-disciplinary perspective on evaluation.