LG AI MLSep 14, 2022

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

Xiaoteng Ma, Zhipeng Liang, Jose Blanchet, Mingwen Liu, Li Xia, Jiheng Zhang, Qianchuan Zhao, Zhengyuan Zhou

Tsinghua

arXiv:2209.06620v320.233 citationsh-index: 37

Originality Highly original

AI Analysis

It addresses critical barriers for applying RL to real-world problems, such as robotics or healthcare, by providing robust policies that handle distribution shifts, though it is incremental in extending robustness to linear function approximation.

This paper tackles the problem of limited data and environment mismatch in reinforcement learning by proposing distributionally robust offline RL with linear function approximation, achieving error bounds of $ ilde{O}(d^{1/2}/N^{1/2})$ and $ ilde{O}(d^{3/2}/N^{1/2})$ for two settings, which are the first non-asymptotic sample complexity results in this context.

Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.

View on arXiv PDF

Similar