LGAIJun 6, 2023

Boosting Offline Reinforcement Learning with Action Preference Query

arXiv:2306.03362v117 citationsh-index: 37
Originality Highly original
AI Analysis

This work addresses the challenge of inaccessible or catastrophic online interactions in high-stake scenarios like healthcare and autonomous driving by providing a safer, offline alternative.

The paper tackles the problem of erroneous estimates in offline reinforcement learning by introducing an interaction-free training scheme called Offline-with-Action-Preferences (OAP), which uses action preference queries to improve policy evaluation and achieves a 29% average score increase on the D4RL benchmark, with a 98% improvement on AntMaze tasks.

Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy's performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy's performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes