AIFeb 15, 2024

Aligning Crowd Feedback via Distributional Preference Reward Modeling

Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu

arXiv:2402.09764v320.024 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the issue of skewed models in reinforcement learning for LLM alignment, which fail to represent broader population expectations, though it appears incremental as it builds on existing reward modeling frameworks.

The paper tackles the problem of aligning large language models with diverse human preferences by proposing the Distributional Preference Reward Model (DPRM), which uses a categorical distribution and Bayesian updates to accommodate multiple preferences, resulting in more accurate, unbiased, and contextually appropriate responses.

Deep Reinforcement Learning is widely used for aligning Large Language Models (LLM) with human preference. However, the conventional reward modelling is predominantly dependent on human annotations provided by a select cohort of individuals. Such dependence may unintentionally result in skewed models that reflect the inclinations of these annotators, thereby failing to adequately represent the wider population's expectations. We propose the Distributional Preference Reward Model (DPRM), a simple yet effective framework to align large language models with diverse human preferences. To this end, we characterize multiple preferences by a categorical distribution and introduce a Bayesian updater to accommodate shifted or new preferences. On top of that, we design an optimal-transportation-based loss to calibrate DPRM to align with the preference distribution. Finally, the expected reward is utilized to fine-tune an LLM policy to generate responses favoured by the population. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.

View on arXiv PDF

Similar