LGJun 22, 2021

Off-Policy Reinforcement Learning with Delayed Rewards

arXiv:2106.11854v149 citations
Originality Incremental advance
AI Analysis

This addresses a challenge in real-world RL tasks where rewards are not immediate, but it is incremental as it builds on existing off-policy methods.

The paper tackles the problem of reinforcement learning with delayed rewards by introducing a new off-policy framework with a Q-function formulation that handles non-Markovian environments, achieving superior performance over existing methods in experiments.

We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes