LGAIJul 19, 2021

Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning

arXiv:2107.08888v37 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of sustainable exploration in reinforcement learning for agents, though it appears incremental as it builds on existing reward shaping methods.

The paper tackles the problem of vanishing intrinsic rewards in reinforcement learning exploration by introducing Jain's fairness index (JFI) as a novel metric to replace entropy regularizers, combined with a VAE model to capture state novelty, achieving higher performance than benchmark schemes in simulations.

Maintaining the long-term exploration capability of the agent remains one of the critical challenges in deep reinforcement learning. A representative solution is to leverage reward shaping to provide intrinsic rewards for the agent to encourage exploration. However, most existing methods suffer from vanishing intrinsic rewards, which cannot provide sustainable exploration incentives. Moreover, they rely heavily on complex models and additional memory to record learning procedures, resulting in high computational complexity and low robustness. To tackle this problem, entropy-based methods are proposed to evaluate the global exploration performance, encouraging the agent to visit the state space more equitably. However, the sample complexity of estimating the state visitation entropy is prohibitive when handling environments with high-dimensional observations. In this paper, we introduce a novel metric entitled Jain's fairness index (JFI) to replace the entropy regularizer, which solves the exploration problem from a brand new perspective. In sharp contrast to the entropy regularizer, JFI is more computable and robust and can be easily applied generalized into arbitrary tasks. Furthermore, we leverage a variational auto-encoder (VAE) model to capture the life-long novelty of states, which is combined with the global JFI score to form multimodal intrinsic rewards. Finally, extensive simulation results demonstrate that our multimodal reward shaping (MMRS) method can achieve higher performance than other benchmark schemes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes