LG AIOct 24, 2022

MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling

Julius Ott, Lorenzo Servadei, Jose Arjona-Medina, Enrico Rinaldi, Gianfranco Mauro, Daniela Sánchez Lopera, Michael Stephan, Thomas Stadelmayer, Avik Santra, Robert Wille

arXiv:2210.13545v21.8h-index: 49Has Code

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in reinforcement learning for researchers and practitioners by improving sampling efficiency in experience replay buffers, though it is incremental as it builds on existing sampling strategies.

The paper tackles the problem of data selection in reinforcement learning by proposing a new buffer sampling strategy that incorporates uncertainty in Q-value estimation to adapt exploration and exploitation, resulting in an average 26% improvement in convergence and peak performance over state-of-the-art methods on dense reward tasks.

Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.

View on arXiv PDF Code

Similar