LGAIDec 30, 2024

LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

arXiv:2412.21001v21 citationsh-index: 16Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of high labeling costs for human feedback in offline reinforcement learning, offering an incremental improvement in sample efficiency.

The paper tackles the challenge of acquiring sufficient preference labels in offline preference-based reinforcement learning by proposing the LEASE algorithm, which uses a learned transition model to generate unlabeled preference data and an uncertainty-aware mechanism to select high-confidence data, achieving comparable performance to baselines with fewer preference data.

Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes