ML AI LGJun 14, 2025

Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Jiancong Xiao, Zhekun Shi, Kaizhao Liu, Qi Long, Weijie J. Su

arXiv:2506.12350v114.05 citationsh-index: 36

Originality Incremental advance

AI Analysis

This addresses foundational inconsistencies in RLHF for AI alignment, offering theoretical explanations and improvements, though it is incremental in nature.

The paper resolves the paradox of RLHF's empirical success despite violating social choice theory axioms by showing that under mild assumptions, RLHF satisfies pairwise majority and Condorcet consistency, and proposes modifications to improve alignment and new criteria for learning response distributions.

Despite its empirical success, Reinforcement Learning from Human Feedback (RLHF) has been shown to violate almost all the fundamental axioms in social choice theory -- such as majority consistency, pairwise majority consistency, and Condorcet consistency. This raises a foundational question: why does RLHF perform so well in practice if it fails these seemingly essential properties? In this paper, we resolve this paradox by showing that under mild and empirically plausible assumptions on the preference profile, RLHF does satisfy pairwise majority and Condorcet consistency. These assumptions are frequently satisfied in real-world alignment tasks, offering a theoretical explanation for RLHF's strong practical performance. Furthermore, we show that a slight modification to the reward modeling objective can ensure pairwise majority or Condorcet consistency even under general preference profiles, thereby improving the alignment process. Finally, we go beyond classical axioms in economic and social choice theory and introduce new alignment criteria -- preference matching, preference equivalence, and group preference matching -- that better reflect the goal of learning distributions over responses. We show that while RLHF satisfies the first two properties, it fails to satisfy the third. We conclude by discussing how future alignment methods may be designed to satisfy all three.

View on arXiv PDF

Similar