CoinRun: Solving Goal Misgeneralisation
This addresses the challenge of aligning AI goals with human intentions, particularly in novel situations, though it appears incremental as it applies an existing method to a specific benchmark.
The paper tackled the problem of goal misgeneralisation in AI alignment by applying the ACE agent to solve the CoinRun challenge, achieving this without using new reward information in the new environment.
Goal misgeneralisation is a key challenge in AI alignment -- the task of getting powerful Artificial Intelligences to align their goals with human intentions and human morality. In this paper, we show how the ACE (Algorithm for Concept Extrapolation) agent can solve one of the key standard challenges in goal misgeneralisation: the CoinRun challenge. It uses no new reward information in the new environment. This points to how autonomous agents could be trusted to act in human interests, even in novel and critical situations.