LGJun 5

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

arXiv:2505.1089224.612 citations
Originality Highly original
AI Analysis

For LLM alignment practitioners, MOPO provides a principled method to handle multiple conflicting human preferences, improving upon single-objective approaches like DPO.

MOPO addresses multi-objective preference alignment in LLMs, enabling a policy to balance conflicting objectives like helpfulness and harmlessness via constrained KL-regularized optimization. It achieves Pareto-optimal policies on synthetic benchmarks and Pareto-dominates baselines on human-preference data with multi-billion parameter models.

Post-training LLMs with RLHF and preference optimization methods (e.g., DPO, IPO) has greatly improved alignment, yet these approaches assume a single objective. In reality, humans express multiple, often conflicting objectives, such as helpfulness and harmlessness, with no natural scalarization. We study the multi-objective preference alignment problem, where a policy must balance several objectives simultaneously. We propose Multi-Objective Preference Optimization (MOPO), a constrained KL-regularized framework that maximizes a primary objective while enforcing lower bounds on secondary objectives via tunable safety thresholds. MOPO operates directly on pairwise preferences without point-wise rewards, and admits simple closed-form iterative updates. Empirically, MOPO recovers Pareto-optimal policies on synthetic benchmarks and, when fine-tuned on human-preference data, yields multi-billion parameter models that achieve higher rewards and Pareto-dominate baselines, with stable and robust optimization dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes