LG AIMay 29

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

arXiv:2605.3090396.7h-index: 11

AI Analysis

This work is significant for researchers and practitioners in inverse reinforcement learning who deal with real-world datasets where demonstrations are often imperfect and heterogeneous, offering a robust method to learn rewards in such challenging scenarios.

This paper addresses the problem of learning rewards from demonstrations provided by multiple imperfect demonstrators with varying suboptimality levels. It introduces a feasible-reward-set framework that encodes each demonstrator's suboptimality as a linear constraint, demonstrating that the joint feasible set shrinks monotonically with more data and providing recovery guarantees for the ground-truth optimal demonstrator's reward set. The framework is shown to be effective in tabular grid-world and LLM fine-tuning experiments.

Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.

View on arXiv PDF

Similar