LG AI CL GTOct 30, 2024

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

Yixin Liu, Argyris Oikonomou, Weiqiang Zheng, Yang Cai, Arman Cohan

arXiv:2410.23223v211.55 citationsh-index: 22Has Code

Originality Highly original

AI Analysis

This addresses the limitation of existing alignment methods like RLHF that rely on simplified preference assumptions, offering a more robust solution for AI safety and performance in real-world applications.

The paper tackles the problem of aligning large language models with general human preferences by proposing COMAL, a meta-algorithm that converges to a Nash equilibrium policy, achieving win rates above 60.2% and 56.8% against other algorithms in evaluations.

Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.

View on arXiv PDF Code

Similar