CLJan 23, 2025

Can Large Language Models Understand Preferences in Personalized Recommendation?

Zhaoxuan Tan, Zinan Zeng, Qingkai Zeng, Zhenyu Wu, Zheyuan Liu, Fengran Mo, Meng Jiang

arXiv:2501.13391v113.011 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This work addresses a critical evaluation gap in personalized recommendation for AI researchers, though it is incremental as it builds on existing ranking methods.

The authors tackled the problem of evaluating LLMs in personalized recommendation by introducing PerRecBench, which isolates user rating bias and item quality to assess personal preference capture; they found that LLMs, even larger models, struggle with this task, with pairwise and listwise ranking outperforming pointwise approaches.

Large Language Models (LLMs) excel in various tasks, including personalized recommendations. Existing evaluation methods often focus on rating prediction, relying on regression errors between actual and predicted ratings. However, user rating bias and item quality, two influential factors behind rating scores, can obscure personal preferences in user-item pair data. To address this, we introduce PerRecBench, disassociating the evaluation from these two factors and assessing recommendation techniques on capturing the personal preferences in a grouped ranking manner. We find that the LLM-based recommendation techniques that are generally good at rating prediction fail to identify users' favored and disfavored items when the user rating bias and item quality are eliminated by grouping users. With PerRecBench and 19 LLMs, we find that while larger models generally outperform smaller ones, they still struggle with personalized recommendation. Our findings reveal the superiority of pairwise and listwise ranking approaches over pointwise ranking, PerRecBench's low correlation with traditional regression metrics, the importance of user profiles, and the role of pretraining data distributions. We further explore three supervised fine-tuning strategies, finding that merging weights from single-format training is promising but improving LLMs' understanding of user preferences remains an open research problem. Code and data are available at https://github.com/TamSiuhin/PerRecBench

View on arXiv PDF Code

Similar