LG AI ROAug 27, 2024

No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Alexander Rutherford, Michael Beukman, Timon Willi, Bruno Lacerda, Nick Hawes, Jakob Foerster

arXiv:2408.15099v322.429 citationsh-index: 19Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of inefficient curriculum discovery in reinforcement learning for researchers and practitioners, representing an incremental improvement over existing UED methods.

The paper investigated how Unsupervised Environment Design (UED) methods select training environments in reinforcement learning, finding that current approximations fail to predict learnability and instead prioritize mastered scenarios. The authors developed a method that directly trains on high-learnability scenarios, outperforming existing UED methods in binary-outcome environments like Minigrid and a novel robotics-inspired setting.

What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. Surprisingly, despite methods aiming to maximise regret in theory, the practical approximations do not correlate with regret but with success rate. As a result, a significant portion of an agent's experience comes from environments it has already mastered, offering little to no contribution toward enhancing its abilities. Put differently, current methods fail to predict intuitive measures of ``learnability.'' Specifically, they are unable to consistently identify those scenarios that the agent can sometimes solve, but not always. Based on our analysis, we develop a method that directly trains on scenarios with high learnability. This simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.

View on arXiv PDF Code

Similar