AIJun 24, 2025

RecLLM-R1: A Two-Stage Training Paradigm with Reinforcement Learning and Chain-of-Thought v1

Yu Xie, Xingkai Ren, Ying Qi, Yao Hu, Lianlei Shan

arXiv:2506.19235v17.85 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses filter bubbles and business policy integration for recommendation systems, representing an incremental improvement by combining existing techniques like LLMs and reinforcement learning in a novel way.

The paper tackled the problem of filter bubbles and suboptimal business alignment in recommendation systems by introducing RecLLM-R1, a framework using LLMs with a two-stage training paradigm involving SFT and reinforcement learning with Chain-of-Thought, which significantly outperformed baselines on accuracy, diversity, and novelty metrics in real-world evaluations.

Traditional recommendation systems often grapple with "filter bubbles", underutilization of external knowledge, and a disconnect between model optimization and business policy iteration. To address these limitations, this paper introduces RecLLM-R1, a novel recommendation framework leveraging Large Language Models (LLMs) and drawing inspiration from the DeepSeek R1 methodology. The framework initiates by transforming user profiles, historical interactions, and multi-faceted item attributes into LLM-interpretable natural language prompts through a carefully engineered data construction process. Subsequently, a two-stage training paradigm is employed: the initial stage involves Supervised Fine-Tuning (SFT) to imbue the LLM with fundamental recommendation capabilities. The subsequent stage utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning technique, augmented with a Chain-of-Thought (CoT) mechanism. This stage guides the model through multi-step reasoning and holistic decision-making via a flexibly defined reward function, aiming to concurrently optimize recommendation accuracy, diversity, and other bespoke business objectives. Empirical evaluations on a real-world user behavior dataset from a large-scale social media platform demonstrate that RecLLM-R1 significantly surpasses existing baseline methods across a spectrum of evaluation metrics, including accuracy, diversity, and novelty. It effectively mitigates the filter bubble effect and presents a promising avenue for the integrated optimization of recommendation models and policies under intricate business goals.

View on arXiv PDF

Similar