LGJan 18, 2023

DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training

Philipp Altmann, Thomy Phan, Fabian Ritz, Thomas Gabor, Claudia Linnhoff-Popien

arXiv:2301.07421v12.01 citationsh-index: 27

Originality Incremental advance

AI Analysis

This addresses a challenge in reinforcement learning for agents operating in environments with limited or changing feedback, though it appears incremental as it builds upon self-imitation learning.

The paper tackles the problem of learning from sparse and shifting rewards in reinforcement learning by proposing DIRECT, an extension that uses discriminative reward co-training to provide surrogate rewards, resulting in outperformance of state-of-the-art algorithms in such environments.

We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.

View on arXiv PDF

Similar