LGAIMay 17, 2025

Pairwise Calibrated Rewards for Pluralistic Alignment

Harvard
arXiv:2506.06298v16 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the issue of minority perspectives being discounted in AI alignment, offering a solution for more inclusive and context-aware systems, though it is incremental in its approach.

The paper tackles the problem of aligning AI systems with diverse human preferences by proposing a method to learn a distribution over multiple reward functions from pairwise preferences, achieving improved calibration to better represent pluralistic values.

Current alignment pipelines presume a single, universal notion of desirable behavior. However, human preferences often diverge across users, contexts, and cultures. As a result, disagreement collapses into the majority signal and minority perspectives are discounted. To address this, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The distribution is learned directly from pairwise preference without annotator identifiers or predefined groups. Instead, annotator disagreements are treated as informative soft labels. Our central criterion is pairwise calibration: for every pair of candidate responses, the proportion of reward functions preferring one response matches the fraction of annotators with that preference. We prove that even a small outlier-free ensemble can accurately represent diverse preference distributions. Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes