CLAILGNov 14, 2023

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

arXiv:2311.08045v40.2642 citationsh-index: 4Has Code
AI Analysis50

This addresses the efficiency of human preference optimization for LLM developers by reducing annotation costs, though it appears incremental as it builds on existing alignment methods.

The paper tackles the distribution gap problem in human preference alignment for large language models by proposing an Adversarial Preference Optimization framework, which enhances existing alignment baselines in terms of helpfulness and harmlessness without requiring additional annotation.

Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game. Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation. With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is at https://github.com/Linear95/APO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes