LG AISep 11, 2024

Policy Filtration for RLHF to Mitigate Noise in Reward Models

Chuheng Zhang, Wei Shen, Li Zhao, Xuyun Zhang, Xiaolong Xu, Wanchun Dou, Jiang Bian

Tsinghua

arXiv:2409.06957v517.617 citationsh-index: 21Has Code

Originality Incremental advance

AI Analysis

This addresses noise in reward models for RLHF, improving fine-tuning of LLMs in tasks like code generation and math reasoning, but it is incremental as it builds on existing PPO methods.

The paper tackles the problem of inaccurate reward models in RLHF by filtering unreliable samples to improve signal-to-noise ratio, resulting in PF-PPO which achieves state-of-the-art performance on code generation tasks with gains like +7.9% on HumanEval and +10.0% on a new benchmark.

While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination (R2) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a more challenging benchmark created by us. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH). Code is available on https://github.com/DtYXs/verl/tree/pf-ppo.

View on arXiv PDF Code

Similar