Gap-Increasing Policy Evaluation for Efficient and Noise-Tolerant Reinforcement Learning
This work addresses noise and inefficiency issues in RL policy evaluation, which is crucial for real-world applications, but it appears incremental as it builds on existing ideas like Retrace and advantage learning.
The paper tackles the problem of noise and inefficiency in reinforcement learning policy evaluation by introducing GRAPE, a novel algorithm that combines gap-increasing operators for noise-tolerance and off-policy eligibility traces for efficiency, achieving significant efficiency gains while maintaining noise-tolerance in control problems.
In real-world applications of reinforcement learning (RL), noise from inherent stochasticity of environments is inevitable. However, current policy evaluation algorithms, which plays a key role in many RL algorithms, are either prone to noise or inefficient. To solve this issue, we introduce a novel policy evaluation algorithm, which we call Gap-increasing RetrAce Policy Evaluation (GRAPE). It leverages two recent ideas: (1) gap-increasing value update operators in advantage learning for noise-tolerance and (2) off-policy eligibility trace in Retrace algorithm for efficient learning. We provide detailed theoretical analysis of the new algorithm that shows its efficiency and noise-tolerance inherited from Retrace and advantage learning. Furthermore, our analysis shows that GRAPE's learning is significantly efficient than that of a simple learning-rate-based approach while keeping the same level of noise-tolerance. We applied GRAPE to control problems and obtained experimental results supporting our theoretical analysis.