LGMar 15, 2024

Online Policy Learning from Offline Preferences

arXiv:2403.10160v1h-index: 5
Originality Incremental advance
AI Analysis

This work addresses a specific issue in reinforcement learning for agents learning from human preferences, though it appears incremental as it builds on existing PbRL methods.

The paper tackles the problem of reward function generalizability in preference-based reinforcement learning when using offline preferences, by introducing a framework that combines offline and virtual preferences to better align the reward function with the agent's behaviors. The result is demonstrated as effective through experiments on continuous control tasks.

In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes