IR AIJan 27

HELM: A Human-Centered Evaluation Framework for LLM-Powered Recommender Systems

arXiv:2601.19197v12.3h-index: 63Has Code

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation of LLM-based recommender systems to improve real-world user experience, though it is incremental as it builds on existing evaluation practices.

The paper tackles the problem that existing evaluation methods for LLM-powered recommender systems focus too much on accuracy and miss human-centered qualities, introducing HELM as a framework to assess five dimensions like explanation quality and fairness, with results showing GPT-4 excels in some areas but has high popularity bias (Gini coefficient 0.73 vs. 0.58 for traditional methods).

The integration of Large Language Models (LLMs) into recommendation systems has introduced unprecedented capabilities for natural language understanding, explanation generation, and conversational interactions. However, existing evaluation methodologies focus predominantly on traditional accuracy metrics, failing to capture the multifaceted human-centered qualities that determine the real-world user experience. We introduce \framework{} (\textbf{H}uman-centered \textbf{E}valuation for \textbf{L}LM-powered reco\textbf{M}menders), a comprehensive evaluation framework that systematically assesses LLM-powered recommender systems across five human-centered dimensions: \textit{Intent Alignment}, \textit{Explanation Quality}, \textit{Interaction Naturalness}, \textit{Trust \& Transparency}, and \textit{Fairness \& Diversity}. Through extensive experiments involving three state-of-the-art LLM-based recommenders (GPT-4, LLaMA-3.1, and P5) across three domains (movies, books, and restaurants), and rigorous evaluation by 12 domain experts using 847 recommendation scenarios, we demonstrate that \framework{} reveals critical quality dimensions invisible to traditional metrics. Our results show that while GPT-4 achieves superior explanation quality (4.21/5.0) and interaction naturalness (4.35/5.0), it exhibits a significant popularity bias (Gini coefficient 0.73) compared to traditional collaborative filtering (0.58). We release \framework{} as an open-source toolkit to advance human-centered evaluation practices in the recommender systems community.

View on arXiv PDF

Similar