LG AINov 8, 2022

ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data

Tengyang Xie, Mohak Bhardwaj, Nan Jiang, Ching-An Cheng

arXiv:2211.04538v111.110 citationsh-index: 34

Originality Incremental advance

AI Analysis

This work addresses the challenge of safely deploying offline RL in real-world systems where avoiding performance drops is critical, though it is incremental as it builds on existing concepts like relative pessimism.

The paper tackles the problem of robust policy improvement in offline reinforcement learning by proposing ARMOR, a model-based framework that guarantees no performance degradation relative to an arbitrary baseline policy and can compete with the best policy under certain conditions.

We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, ARMOR is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned, and the baseline policy is supported by the data. Such a robust policy improvement property makes ARMOR especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring.

View on arXiv PDF

Similar