IRAIDec 15, 2024

RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models

arXiv:2412.11068v11 citationsh-index: 7Has Code
Originality Incremental advance
AI Analysis

This addresses the need for more nuanced offline evaluation in recommender systems, offering a method to simulate user preferences without online experiments, though it is incremental as it builds on existing LLM-based evaluation ideas.

The paper tackles the problem of offline evaluation for recommender systems by introducing RecSys Arena, which uses large language models (LLMs) as judges to simulate user feedback on recommendation results, showing that LLMs provide evaluation consistent with offline metrics and better distinguish between competitive algorithms.

Evaluating the quality of recommender systems is critical for algorithm design and optimization. Most evaluation methods are computed based on offline metrics for quick algorithm evolution, since online experiments are usually risky and time-consuming. However, offline evaluation usually cannot fully reflect users' preference for the outcome of different recommendation algorithms, and the results may not be consistent with online A/B test. Moreover, many offline metrics such as AUC do not offer sufficient information for comparing the subtle differences between two competitive recommender systems in different aspects, which may lead to substantial performance differences in long-term online serving. Fortunately, due to the strong commonsense knowledge and role-play capability of large language models (LLMs), it is possible to obtain simulated user feedback on offline recommendation results. Motivated by the idea of LLM Chatbot Arena, in this paper we present the idea of RecSys Arena, where the recommendation results given by two different recommender systems in each session are evaluated by an LLM judger to obtain fine-grained evaluation feedback. More specifically, for each sample we use LLM to generate a user profile description based on user behavior history or off-the-shelf profile features, which is used to guide LLM to play the role of this user and evaluate the relative preference for two recommendation results generated by different models. Through extensive experiments on two recommendation datasets in different scenarios, we demonstrate that many different LLMs not only provide general evaluation results that are highly consistent with canonical offline metrics, but also provide rich insight in many subjective aspects. Moreover, it can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes