Pareto-Optimal Learning from Preferences with Hidden Context
This addresses the challenge of ensuring fair and effective AI alignment for diverse populations, representing an incremental improvement over existing RLHF methods.
The paper tackles the problem of aligning AI models with diverse human preferences in reinforcement learning from human feedback (RLHF) by proposing Pareto Optimal Preference Learning (POPL), which frames group preferences as trade-offs to achieve Pareto-optimal policies, and demonstrates that POPL surpasses baseline methods in learning reward functions and policies across various domains without needing group labels.
Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.