Reward Difference Optimization For Sample Reweighting In Offline RLHF
This addresses the alignment of large language models with human preferences in a more efficient offline setting, though it is incremental as it builds on existing offline RLHF methods.
The paper tackles the problem that offline RLHF only captures ordinal relationships between responses, missing the degree of preference, by proposing Reward Difference Optimization (RDO) to reweigh sample pairs, resulting in improved performance on HH and TL;DR datasets with 7B LLMs in both automatic metrics and human evaluation.
With the rapid advances in Large Language Models (LLMs), aligning LLMs with human preferences become increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. To address this issue, we propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then develop a difference model which captures rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation, thereby highlighting its potential for aligning LLMs with human intent and values