LGMar 8, 2025

Language Model Personalization via Reward Factorization

Idan Shenfeld, Felix Faltings, Pulkit Agrawal, Aldo Pacchiano

arXiv:2503.06358v121 citationsh-index: 15

Originality Incremental advance

AI Analysis

This addresses the limitation of universal preference models in RLHF for personalized applications, though it is incremental as it builds on existing RLHF methods.

The paper tackles the problem of personalizing large language models to individual user preferences by extending RLHF with a low-dimensional reward factorization framework, achieving a 67% win rate over default GPT-4o responses in human evaluations.

Modern large language models (LLMs) are optimized for human-aligned responses using Reinforcement Learning from Human Feedback (RLHF). However, existing RLHF approaches assume a universal preference model and fail to account for individual user preferences, limiting their effectiveness in personalized applications. We introduce a framework that extends RLHF to enable user personalization by leveraging the assumption that user preferences lie in a low-dimensional space. Instead of training a separate model per user, we represent user-specific rewards as a linear combination of base reward functions. Using only ~10 user responses, our method can infer user-specific rewards and align LLM outputs accordingly. We validate our approach through experiments with both synthetic and real users, demonstrating significant personalization achieved by our method. In human evaluations, our method achieves a 67% win rate over default GPT-4o responses.

View on arXiv PDF

Similar