On Sampling Top-K Recommendation Evaluation
This addresses concerns for researchers and practitioners in recommendation systems about the validity of widely used evaluation metrics, though it is incremental as it confirms existing practices rather than introducing new methods.
The paper tackles the problem of whether sampling-based top-k metrics are reliable for evaluating recommendation algorithms, demonstrating both theoretically and experimentally that sampling top-k Hit-Ratio accurately approximates global Hit-Ratio and consistently predicts correct winners.
Recently, Rendle has warned that the use of sampling-based top-$k$ metrics might not suffice. This throws a number of recent studies on deep learning-based recommendation algorithms, and classic non-deep-learning algorithms using such a metric, into jeopardy. In this work, we thoroughly investigate the relationship between the sampling and global top-$K$ Hit-Ratio (HR, or Recall), originally proposed by Koren[2] and extensively used by others. By formulating the problem of aligning sampling top-$k$ ($SHR@k$) and global top-$K$ ($HR@K$) Hit-Ratios through a mapping function $f$, so that $SHR@k\approx HR@f(k)$, we demonstrate both theoretically and experimentally that the sampling top-$k$ Hit-Ratio provides an accurate approximation of its global (exact) counterpart, and can consistently predict the correct winners (the same as indicate by their corresponding global Hit-Ratios).