CLAINov 1, 2024

Self-Evolved Reward Learning for LLMs

arXiv:2411.00418v320 citationsh-index: 28
Originality Incremental advance
AI Analysis

This addresses the problem of costly and potentially biased human feedback in reinforcement learning for large language models, offering an incremental improvement in reward learning efficiency.

The paper tackles the challenge of training reliable reward models for aligning language models with human preferences by proposing Self-Evolved Reward Learning (SER), where the reward model generates additional training data to iteratively improve itself, and results show it robustly enhances reward model performance even with limited human-annotated data.

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, ChatGPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model's responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and UltraFeedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs). Resources of this paper can be found at https://aka.ms/ser

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes