IRLGDec 4, 2019

Evaluation Metrics for Item Recommendation under Sampling

arXiv:1912.02263v120 citations
Originality Incremental advance
AI Analysis

This work highlights a critical flaw in widely used evaluation practices for recommendation systems, potentially affecting researchers and practitioners relying on sampled metrics for efficiency.

The paper investigates sampled metrics for evaluating item recommendation algorithms and finds that they are inconsistent with exact metrics, failing to preserve relative performance comparisons even in expectation, with all metrics collapsing to AUC for very small sample sizes.

The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the relevant items are ranked. This paper investigates sampled metrics in more detail and shows that sampled metrics are inconsistent with their exact version. Sampled metrics do not persist relative statements, e.g., 'algorithm A is better than B', not even in expectation. Moreover the smaller the sampling size, the less difference between metrics, and for very small sampling size, all metrics collapse to the AUC metric.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes