Toward a New Protocol to Evaluate Recommender Systems
This work addresses the need for more nuanced evaluation methods in recommender systems for industrial applications, though it is incremental as it builds on existing protocols and measures.
The paper tackles the problem of evaluating recommender systems by proposing a global offline protocol based on four structuring functions, introducing a new measure called Average Measure of Impact, and showing that performance varies by user and item segments with no clear correlation to RMSE.
In this paper, we propose an approach to analyze the performance and the added value of automatic recommender systems in an industrial context. We show that recommender systems are multifaceted and can be organized around 4 structuring functions: help users to decide, help users to compare, help users to discover, help users to explore. A global off line protocol is then proposed to evaluate recommender systems. This protocol is based on the definition of appropriate evaluation measures for each aforementioned function. The evaluation protocol is discussed from the perspective of the usefulness and trust of the recommendation. A new measure called Average Measure of Impact is introduced. This measure evaluates the impact of the personalized recommendation. We experiment with two classical methods, K-Nearest Neighbors (KNN) and Matrix Factorization (MF), using the well known dataset: Netflix. A segmentation of both users and items is proposed to finely analyze where the algorithms perform well or badly. We show that the performance is strongly dependent on the segments and that there is no clear correlation between the RMSE and the quality of the recommendation.