LGAug 19, 2025

Learning from Preferences and Mixed Demonstrations in General Settings

arXiv:2508.14027v11 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the problem of reward specification in reinforcement learning for AI practitioners, offering a more efficient approach by combining multiple feedback types, though it appears incremental in improving existing methods.

The paper tackles the challenge of specifying reward functions in complex reinforcement learning tasks by introducing a flexible and scalable method for learning from human data, such as preferences and demonstrations, and shows that their algorithm LEOPARD outperforms existing baselines by a significant margin when limited feedback is available.

Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, \emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes