IR LGDec 4, 2019

Evaluation Metrics for Item Recommendation under Sampling

arXiv:1912.02263v112.620 citations

Originality Incremental advance

AI Analysis

This work highlights a critical flaw in widely used evaluation practices for recommendation systems, potentially affecting researchers and practitioners relying on sampled metrics for efficiency.

The paper investigates sampled metrics for evaluating item recommendation algorithms and finds that they are inconsistent with exact metrics, failing to preserve relative performance comparisons even in expectation, with all metrics collapsing to AUC for very small sample sizes.

The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the relevant items are ranked. This paper investigates sampled metrics in more detail and shows that sampled metrics are inconsistent with their exact version. Sampled metrics do not persist relative statements, e.g., 'algorithm A is better than B', not even in expectation. Moreover the smaller the sampling size, the less difference between metrics, and for very small sampling size, all metrics collapse to the AUC metric.

View on arXiv PDF

Similar