LG AIOct 18, 2022

Rethinking Value Function Learning for Generalization in Reinforcement Learning

Seungyong Moon, JunYeong Lee, Hyun Oh Song

arXiv:2210.09960v28.717 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of observational generalization in reinforcement learning for agents operating in diverse visual environments, representing an incremental advancement with novel regularization techniques.

The paper tackles the challenge of training reinforcement learning agents to generalize across visually diverse environments by addressing the difficulty of optimizing value networks, which tend to memorize training data. It proposes Delayed-Critic Policy Gradient (DCPG) and a self-supervised task, achieving significant improvements in observational generalization performance and sample efficiency on the Procgen Benchmark.

Our work focuses on training RL agents on multiple visually diverse environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that a value network in the multi-environment setting is more challenging to optimize and prone to memorizing the training data than in the conventional single-environment setting. In addition, we find that appropriate regularization on the value network is necessary to improve both training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network. This can be implemented using a single unified network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency on the Procgen Benchmark.

View on arXiv PDF Code

Similar