LG AI MLJul 16, 2024

Satisficing Exploration for Deep Reinforcement Learning

Dilip Arumugam, Saurabh Kumar, Ramki Gummadi, Benjamin Van Roy

Stanford

arXiv:2407.12185v16.43 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the challenge of data-intensive exploration in deep reinforcement learning for agents in vast, real-world-like environments, offering a more practical approach than aiming for optimality, though it builds incrementally on prior information-theoretic work.

The paper tackles the problem of intractable exploration for optimal policies in complex reinforcement learning environments by proposing an agent that learns satisficing behaviors more efficiently, bypassing model-based planning and achieving satisficing or optimal behaviors with improved efficiency compared to non-information-theoretic methods.

A default assumption in the design of reinforcement-learning algorithms is that a decision-making agent always explores to learn optimal behavior. In sufficiently complex environments that approach the vastness and scale of the real world, however, attaining optimal performance may in fact be an entirely intractable endeavor and an agent may seldom find itself in a position to complete the requisite exploration for identifying an optimal policy. Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions, obtained through lossy compression. Notably, such agents may employ fundamentally different exploratory decisions to learn satisficing behaviors more efficiently than optimal ones that are more data intensive. While supported by a rigorous corroborating theory, the underlying algorithm relies on model-based planning, drastically limiting the compatibility of these ideas with function approximation and high-dimensional observations. In this work, we remedy this issue by extending an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies. We provide simple yet illustrative experiments that demonstrate how our algorithm enables deep reinforcement-learning agents to achieve satisficing behaviors. In keeping with previous work on this setting for multi-armed bandits, we additionally find that our algorithm is capable of synthesizing optimal behaviors, when feasible, more efficiently than its non-information-theoretic counterpart.

View on arXiv PDF

Similar