Reward Redistribution via Gaussian Process Likelihood Estimation
This addresses reward sparsity in reinforcement learning for tasks like robotics, though it is incremental as it builds on existing reward redistribution methods.
The paper tackles the problem of sparse and delayed rewards in reinforcement learning by proposing a Gaussian process based Likelihood Reward Redistribution (GP-LRR) framework, which models reward dependencies and yields superior sample efficiency and policy performance on MuJoCo benchmarks.
In many practical reinforcement learning tasks, feedback is only provided at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state-action pairs. In this paper, we propose a Gaussian process based Likelihood Reward Redistribution (GP-LRR) framework that addresses this issue by modeling the reward function as a sample from a Gaussian process, which explicitly captures dependencies between state-action pairs through the kernel function. By maximizing the likelihood of the observed episodic return via a leave-one-out strategy that leverages the entire trajectory, our framework inherently introduces uncertainty regularization. Moreover, we show that conventional mean-squared-error (MSE) based reward redistribution arises as a special case of our GP-LRR framework when using a degenerate kernel without observation noise. When integrated with an off-policy algorithm such as Soft Actor-Critic, GP-LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on several MuJoCo benchmarks.