AILGFeb 22, 2025

Direct Alignment with Heterogeneous Preferences

Berkeley
arXiv:2502.16320v119 citationsh-index: 20EAAMO
Originality Incremental advance
AI Analysis

This addresses the challenge of AI alignment for diverse user groups, but it is incremental as it builds on existing direct alignment methods.

The paper tackles the problem of aligning AI policies with heterogeneous human preferences, showing that using the average reward across user types is optimal but requires annotator information, and reveals a fundamental tension between consistency and sample efficiency in direct alignment.

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes