IRApr 20

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

arXiv:2403.0395288.1372 citationsh-index: 21
Predicted impact top 9% in IR · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners in recommender systems, this benchmark provides a more relevant evaluation of LLMs for recommendation tasks, revealing that general-purpose benchmarks are insufficient.

The paper introduces BLaIR, a benchmark to evaluate LLMs as semantic encoders for recommendation tasks, using a new large-scale Amazon Reviews 2023 dataset. Experiments with 11 LLMs show that their performance on BLaIR has little correlation with general-purpose embedding benchmarks like MTEB, highlighting the unique challenges of recommendation.

Feature engineering has long been central to recommender systems, yet effectively leveraging textual item features remains challenging. Recent advances in large language models (LLMs) have enabled their use as semantic encoders for recommendation, but their roles and behaviors in this setting are still not well understood. Prior studies often rely on general-purpose embedding benchmarks (e.g., MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. We contribute (1) a new large-scale Amazon Reviews 2023 dataset with over 570 million reviews and 48 million items, (2) a unified benchmark covering sequential recommendation, collaborative filtering, and product search, and (3) a new complex-query product search task featuring both semi-synthetic and real-world evaluation datasets. Experiments with 11 leading LLMs show that their rankings on BLaIR show little correlation with MTEB, highlighting the unique challenges of semantic encoding in recommendation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes