LGMay 26

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Mingqiao Zhang, Qiyao Peng, Yinghui Wang, Hongtao Liu, Yumeng Wang

arXiv:2602.1362662.31 citationsh-index: 12Has Code

AI Analysis

For researchers evaluating LLM-based recommenders, this reveals a critical overlooked confound that undermines reliability of reported results.

The paper identifies benchmark data leakage in LLM-based recommendation, where LLMs memorize benchmark data, causing inflated performance. Experiments show domain-relevant leakage yields spurious gains, while irrelevant leakage degrades accuracy.

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec-Data-Leakage.

View on arXiv PDF Code

Similar