AI LGNov 7, 2023

Hypothesis Network Planned Exploration for Rapid Meta-Reinforcement Learning Adaptation

Maxwell Joseph Jacobson, Rohan Menon, John Zeng, Yexiang Xue

arXiv:2311.03701v25.42 citationsh-index: 3

Originality Highly original

AI Analysis

This addresses the problem of slow adaptation in Meta-RL for scenarios with sparse informative transitions, offering a drop-in improvement for model-based algorithms.

The paper tackled the challenge of rapidly identifying similar tasks in Meta-Reinforcement Learning by introducing Hypothesis-Planned Exploration (HyPE), which actively plans actions to improve adaptation speed, achieving 65-75% task identification accuracy compared to 18-28% for passive methods and up to 4x more successful adaptations.

Meta-Reinforcement Learning (Meta-RL) learns optimal policies across a series of related tasks. A central challenge in Meta-RL is rapidly identifying which previously learned task is most similar to a new one, in order to adapt to it quickly. Prior approaches, despite significant success, typically rely on passive exploration strategies such as periods of random action to characterize the new task in relation to the learned ones. While sufficient when tasks are clearly distinguishable, passive exploration limits adaptation speed when informative transitions are rare or revealed only by specific behaviors. We introduce Hypothesis-Planned Exploration (HyPE), a method that actively plans sequences of actions during adaptation to efficiently identify the most similar previously learned task. HyPE operates within a joint latent space, where state-action transitions from different tasks form distinct paths. This latent-space planning approach enables HyPE to serve as a drop-in improvement for most model-based Meta-RL algorithms. By using planned exploration, HyPE achieves exponentially lower failure probability compared to passive strategies when informative transitions are sparse. On a natural language Alchemy game, HyPE identified the closest task in 65-75% of trials, far outperforming the 18-28% passive exploration baseline, and yielding up to 4x more successful adaptations under the same sample budget.

View on arXiv PDF

Similar