A Reproducibility Analysis of PO4ISR: Diagnosing and Mitigating Semantic Drift in LLM-Based Session Recommendation
Identifies and mitigates a critical failure mode in LLM-based session recommendation for practitioners needing robust cross-domain performance.
PO4ISR suffers from semantic drift in long sessions, degrading performance on complex datasets. The enhanced PO4ISR++ restores performance with gains of 54% on Games and 96% on Bundle.
Reasoning-based Large Language Models (LLMs) like PO4ISR have set new benchmarks in session-based recommendation. However, the reproducibility of their reasoning capabilities across diverse semantic domains remains unexplored. In this work, we conduct a rigorous reproducibility study of PO4ISR to assess its generalization limits. Our analysis reveals a critical failure mode: standard reasoning prompts suffer from severe contextual drift in long sessions, leading to performance degradation on semantically complex datasets like Games and Bundle. To quantify and resolve this stability gap, we introduce PO4ISR++, a robustness-enhanced implementation that integrates reflexive prompting and consistent rank detection. Unlike the original static prompting strategy, our approach dynamically adapts to cross-domain cues. We benchmark both the original implementation and our robust variant on ML-1M, Games, and Bundle. Our results confirm that while the original model struggles in new domains, our reproducible extension restores performance, yielding a stabilized gain of up to 54% on Games and 96% on Bundle. We release open-source artifacts, including the reproduced baseline and our enhanced framework, to facilitate reliable future research in LLM-based recommendation.