IRSep 18, 2024
FLARE: Fusing Language Models and Collaborative Architectures for Recommender EnhancementLiam Hebert, Marialena Kyriakidi, Hubert Pham et al.
Recent proposals in recommender systems represent items with their textual description, using a large language model. They show better results on standard benchmarks compared to an item ID-only model, such as Bert4Rec. In this work, we revisit the often-used Bert4Rec baseline and show that with further tuning, Bert4Rec significantly outperforms previously reported numbers, and in some datasets, is competitive with state-of-the-art models. With revised baselines for item ID-only models, this paper also establishes new competitive results for architectures that combine IDs and textual descriptions. We demonstrate this with Flare (Fusing Language models and collaborative Architectures for Recommender Enhancement). Flare is a novel hybrid sequence recommender that integrates a language model with a collaborative filtering model using a Perceiver network. Prior studies focus evaluation on datasets with limited-corpus size, but many commercially-applicable recommender systems common on the web must handle larger corpora. We evaluate Flare on a more realistic dataset with a significantly larger item vocabulary, introducing new baselines for this setting. This paper also showcases Flare's inherent ability to support critiquing, enabling users to provide feedback and refine recommendations. We leverage critiquing as an evaluation method to assess the model's language understanding and its transferability to the recommendation task.
CLMar 14, 2025
REGEN: A Dataset and Benchmarks with Natural Language Critiques and NarrativesKun Su, Krishna Sayana, Hubert Pham et al.
This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user "steering" queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset's quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.