Revisiting Prioritized Experience Replay: A Value Perspective
This work provides a theoretical foundation for understanding the effectiveness of prioritized experience replay for RL researchers, potentially leading to more principled priority mechanisms.
This paper investigates the relationship between the temporal-difference (TD) error and the value of experience in reinforcement learning. It theoretically demonstrates that the value of experience is upper-bounded by the TD error for Q-learning and derives bounds for soft Q-learning, showing that using the upper bound as a priority metric improves maximum-entropy RL performance in Atari games.
Experience replay enables off-policy reinforcement learning (RL) agents to utilize past experiences to maximize the cumulative reward. Prioritized experience replay that weighs experiences by the magnitude of their temporal-difference error ($|\text{TD}|$) significantly improves the learning efficiency. But how $|\text{TD}|$ is related to the importance of experience is not well understood. We address this problem from an economic perspective, by linking $|\text{TD}|$ to value of experience, which is defined as the value added to the cumulative reward by accessing the experience. We theoretically show the value metrics of experience are upper-bounded by $|\text{TD}|$ for Q-learning. Furthermore, we successfully extend our theoretical framework to maximum-entropy RL by deriving the lower and upper bounds of these value metrics for soft Q-learning, which turn out to be the product of $|\text{TD}|$ and "on-policyness" of the experiences. Our framework links two important quantities in RL: $|\text{TD}|$ and value of experience. We empirically show that the bounds hold in practice, and experience replay using the upper bound as priority improves maximum-entropy RL in Atari games.