IRLGJul 22, 2025

Time to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders

arXiv:2507.16289v114 citationsh-index: 5Has CodeRecSys
Originality Incremental advance
AI Analysis

This work addresses reproducibility issues in sequential recommendation evaluation for researchers and practitioners, though it is incremental as it builds on existing splitting methods.

The paper tackles the problem of insufficient evaluation protocols for sequential recommender systems by systematically comparing data splitting strategies, finding that common splits like leave-one-out can misalign with realistic scenarios and affect model rankings.

Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: https://github.com/monkey0head/time-to-split

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes