VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts
This addresses a less researched aspect of exploration in RL, offering a more efficient alternative to blind switching mechanisms like ε-greedy.
The paper tackles the problem of when to explore in deep reinforcement learning by proposing VDSC, a method that uses the agent's internal state to decide exploration timing, and shows it outperforms traditional and sophisticated methods on the Atari suite.
Despite the considerable attention given to the questions of \textit{how much} and \textit{how to} explore in deep reinforcement learning, the investigation into \textit{when} to explore remains relatively less researched. While more sophisticated exploration strategies can excel in specific, often sparse reward environments, existing simpler approaches, such as $ε$-greedy, persist in outperforming them across a broader spectrum of domains. The appeal of these simpler strategies lies in their ease of implementation and generality across a wide range of domains. The downside is that these methods are essentially a blind switching mechanism, which completely disregards the agent's internal state. In this paper, we propose to leverage the agent's internal state to decide \textit{when} to explore, addressing the shortcomings of blind switching mechanisms. We present Value Discrepancy and State Counts through homeostasis (VDSC), a novel approach for efficient exploration timing. Experimental results on the Atari suite demonstrate the superiority of our strategy over traditional methods such as $ε$-greedy and Boltzmann, as well as more sophisticated techniques like Noisy Nets.