LG AI NENov 20, 2022

Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows

Dmitriy Akimov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov, Sergey Kolesnikov

arXiv:2211.11096v216.518 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses offline RL problems for training policies from fixed datasets, offering an incremental improvement by using Normalizing Flows to enhance conservatism.

The paper tackled the challenges of extrapolation error and distributional shift in offline reinforcement learning by training conservative agents in the latent space of Normalizing Flows, resulting in outperformance over recent generative action model algorithms on various locomotion and navigation tasks.

Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.

View on arXiv PDF Code

Similar