ML AI LGApr 3, 2025

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, Chengchun Shi

arXiv:2504.03784v520.915 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of aligning large language models with human preferences more reliably, offering an incremental improvement for AI safety and performance in real-world applications.

The paper tackles the problem of reward model misspecification in reinforcement learning from human feedback for large language models, proposing a robust algorithm that reduces variance and improves regret bounds, with empirical results showing 77-81% of responses favored over baselines on the Anthropic Helpful and Harmless dataset.

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.

View on arXiv PDF

Similar