Depth and nonlinearity induce implicit exploration for RL
This addresses the challenge of exploration in RL for practitioners by offering a deterministic alternative to stochastic methods, though it appears incremental as it builds on existing Q-learning frameworks.
The paper tackles the problem of exploration in reinforcement learning by showing that Q-learning with a nonlinear Q-function and a purely greedy policy can match or exceed the performance of ε-greedy exploration on standard benchmarks like mountain car, with specific improvements noted in learning efficiency.
The question of how to explore, i.e., take actions with uncertain outcomes to learn about possible future rewards, is a key question in reinforcement learning (RL). Here, we show a surprising result: We show that Q-learning with nonlinear Q-function and no explicit exploration (i.e., a purely greedy policy) can learn several standard benchmark tasks, including mountain car, equally well as, or better than, the most commonly-used $ε$-greedy exploration. We carefully examine this result and show that both the depth of the Q-network and the type of nonlinearity are important to induce such deterministic exploration.