Value-aware Importance Weighting for Off-policy Reinforcement Learning
This work addresses stability issues in off-policy RL, which is crucial for applications like robotics and recommendation systems, but it is incremental as it builds on existing importance sampling methods.
The paper tackles the problem of high variance in importance sampling for off-policy reinforcement learning by proposing value-aware importance weights that reduce variance while maintaining unbiased estimates, and it extends several prediction algorithms to show empirical improvements.
Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance weights to correct samples in off-policy learning. We propose the use of $\textit{value-aware importance weights}$ which take into account the sample space to provide lower variance, but still unbiased, estimates under a target distribution. We derive how such weights can be computed, and detail key properties of the resulting importance weights. We then extend several reinforcement learning prediction algorithms to the off-policy setting with these weights, and evaluate them empirically.