AIFeb 15, 2024

Aligning Crowd Feedback via Distributional Preference Reward Modeling

arXiv:2402.09764v324 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the issue of skewed models in reinforcement learning for LLM alignment, which fail to represent broader population expectations, though it appears incremental as it builds on existing reward modeling frameworks.

The paper tackles the problem of aligning large language models with diverse human preferences by proposing the Distributional Preference Reward Model (DPRM), which uses a categorical distribution and Bayesian updates to accommodate multiple preferences, resulting in more accurate, unbiased, and contextually appropriate responses.

Deep Reinforcement Learning is widely used for aligning Large Language Models (LLM) with human preference. However, the conventional reward modelling is predominantly dependent on human annotations provided by a select cohort of individuals. Such dependence may unintentionally result in skewed models that reflect the inclinations of these annotators, thereby failing to adequately represent the wider population's expectations. We propose the Distributional Preference Reward Model (DPRM), a simple yet effective framework to align large language models with diverse human preferences. To this end, we characterize multiple preferences by a categorical distribution and introduce a Bayesian updater to accommodate shifted or new preferences. On top of that, we design an optimal-transportation-based loss to calibrate DPRM to align with the preference distribution. Finally, the expected reward is utilized to fine-tune an LLM policy to generate responses favoured by the population. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes