Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning
This work addresses the challenge of aligning AI behavior with multi-faceted human values, providing a principled foundation for learning aligned behaviors in high-dimension environments, though it is incremental in bridging practical alignment techniques with theoretical guarantees.
The paper tackles the problem of aligning generative agents with complex human values by developing a theoretical framework for multi-objective inverse reinforcement learning, where human preferences are modeled as latent vector-valued reward functions, and it establishes conditions for recovering Pareto-optimal rewards with tight sample complexity bounds for ε-approximations.
As generative agents become increasingly capable, alignment of their behavior with complex human values remains a fundamental challenge. Existing approaches often simplify human intent through reduction to a scalar reward, overlooking the multi-faceted nature of human feedback. In this work, we introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions. We formalize the problem of recovering a Pareto-optimal reward representation from noisy preference queries and establish conditions for identifying the underlying multi-objective structure. We derive tight sample complexity bounds for recovering $ε$-approximations of the Pareto front and introduce a regret formulation to quantify suboptimality in this multi-objective setting. Furthermore, we propose a provably convergent algorithm for policy optimization using preference-inferred reward cones. Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors in a high-dimension and value-pluralistic environment.