LGJun 14, 2025

Are We Really Measuring Progress? Transferring Insights from Evaluating Recommender Systems to Temporal Link Prediction

Filip Cornell, Oleg Smirnov, Gabriela Zarzar Gandler, Lele Cao

arXiv:2506.12588v17.11 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses evaluation reliability for researchers in graph learning and TLP, but it is incremental as it builds on existing critiques without presenting new results.

The paper tackles issues in evaluating Temporal Link Prediction (TLP) by identifying problems like inconsistent metrics and hard negative sampling, drawing insights from recommender systems to argue for more robust evaluation protocols.

Recent work has questioned the reliability of graph learning benchmarks, citing concerns around task design, methodological rigor, and data suitability. In this extended abstract, we contribute to this discussion by focusing on evaluation strategies in Temporal Link Prediction (TLP). We observe that current evaluation protocols are often affected by one or more of the following issues: (1) inconsistent sampled metrics, (2) reliance on hard negative sampling often introduced as a means to improve robustness, and (3) metrics that implicitly assume equal base probabilities across source nodes by combining predictions. We support these claims through illustrative examples and connections to longstanding concerns in the recommender systems community. Our ongoing work aims to systematically characterize these problems and explore alternatives that can lead to more robust and interpretable evaluation. We conclude with a discussion of potential directions for improving the reliability of TLP benchmarks.

View on arXiv PDF

Similar