It's Time to Consider "Time" when Evaluating Recommender-System Algorithms [Proposal]
This proposal addresses the problem of inadequate evaluation practices in recommender-system research, potentially leading to better algorithm assessments, but it is incremental as it suggests a methodological refinement rather than a new algorithm.
The authors argue that current evaluation metrics for recommender systems, which rely on single numbers like precision or MAE, provide only a static and vague view by averaging over long periods, and propose instead calculating metrics over time-series (e.g., weeks or months) and plotting them to show how effectiveness develops over time for more meaningful future performance predictions.
In this position paper, we question the current practice of calculating evaluation metrics for recommender systems as single numbers (e.g. precision p=.28 or mean absolute error MAE = 1.21). We argue that single numbers express only average effectiveness over a usually rather long period (e.g. a year or even longer), which provides only a vague and static view of the data. We propose that recommender-system researchers should instead calculate metrics for time-series such as weeks or months, and plot the results in e.g. a line chart. This way, results show how algorithms' effectiveness develops over time, and hence the results allow drawing more meaningful conclusions about how an algorithm will perform in the future. In this paper, we explain our reasoning, provide an example to illustrate our reasoning and present suggestions for what the community should do next.