LGAug 7, 2025

RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders

Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou

arXiv:2508.05289v117 citationsh-index: 42025 4th International Conference on Artificial Intelligence, Internet of Things and Cloud Computing Technology (AIoTC)

Originality Incremental advance

AI Analysis

This addresses the challenge of capturing implicit user preferences in conversational recommenders, though it is incremental as it applies existing RLHF methods to a specific domain.

The paper tackles the problem of aligning conversational recommender systems with implicit user feedback by using RLHF fine-tuning to maximize implied user feedback, resulting in improved top-k recommendation accuracy, coherence, and user satisfaction on datasets like REDIAL and OpenDialKG.

Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity, or engagement patterns. In this paper, we share a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context. We specify a reward model $R_φ$ learnt on weakly-labelled engagement information and maximize user-centric utility by optimizing the foundational LLM M_θ through a proximal policy optimization (PPO) approach. The architecture models conversational state transitions $s_t \to a_t \to s_{t +1}$, where the action $a_t$ is associated with LLM-generated item suggestions only on condition of conversation history in the past. The evaluation across synthetic and real-world datasets (e.g.REDIAL, OpenDialKG) demonstrates that our RLHF-fine-tuned models can perform better in terms of top-$k$ recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up This paper shows that implicit signal alignment can be efficient in achieving scalable and user-adaptive design of CRS.

View on arXiv PDF

Similar