Learning the Preferences of a Learning Agent

Karim Abdel Sadek, Mark Bedaywi, Rhys Gould, Stuart Russell

arXiv:2605.0921730.5

AI Analysis

It addresses the limitation of inverse reinforcement learning by considering agents that learn, which is relevant for human-AI interaction scenarios.

This paper formalizes the problem of learning the preferences of a learning agent, where the agent is initially suboptimal but improves over time. The authors provide theoretical guarantees for preference learning algorithms under no-regret or Boltzmann convergence assumptions, and identify settings where guarantees are impossible.

For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.

View on arXiv PDF

Similar