AILGSep 14, 2021

Benchmarking the Spectrum of Agent Capabilities

arXiv:2109.06780v2204 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for efficient benchmarking of AI agents across multiple tasks in one environment, though it is incremental as it builds on existing simulation-based evaluation methods.

The authors tackled the problem of evaluating general agent abilities by introducing Crafter, a single open-world survival game that assesses a wide range of capabilities, and they provided baseline scores showing it drives research with appropriate difficulty.

Evaluating the general abilities of intelligent agents requires complex simulation environments. Existing benchmarks typically evaluate only one narrow task per environment, requiring researchers to perform expensive training runs on many different environments. We introduce Crafter, an open world survival game with visual inputs that evaluates a wide range of general abilities within a single environment. Agents either learn from the provided reward signal or through intrinsic objectives and are evaluated by semantically meaningful achievements that can be unlocked during each episode, such as discovering resources and crafting tools. Consistently unlocking all achievements requires strong generalization, deep exploration, and long-term reasoning. We experimentally verify that Crafter is of appropriate difficulty to drive future research and provide baselines scores of reward agents and unsupervised agents. Furthermore, we observe sophisticated behaviors emerging from maximizing the reward signal, such as building tunnel systems, bridges, houses, and plantations. We hope that Crafter will accelerate research progress by quickly evaluating a wide spectrum of abilities.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes