Evaluating Agents without Rewards
This work is significant for researchers developing intrinsically motivated agents, as it suggests that intrinsic objectives can be better indicators of human-like behavior than task rewards, potentially accelerating the design of more human-aligned AI.
This paper tackles the problem of evaluating agents without explicit reward functions by retrospectively computing and comparing intrinsic objectives on pre-collected datasets. They found that input entropy, information gain, and empowerment correlate more strongly with human behavior similarity than with task reward across seven agents and multiple games.
Reinforcement learning has enabled agents to solve challenging tasks in unknown environments. However, manually crafting reward functions can be time consuming, expensive, and error prone to human error. Competing objectives have been proposed for agents to learn without external supervision, but it has been unclear how well they reflect task rewards or human behavior. To accelerate the development of intrinsic objectives, we retrospectively compute potential objectives on pre-collected datasets of agent behavior, rather than optimizing them online, and compare them by analyzing their correlations. We study input entropy, information gain, and empowerment across seven agents, three Atari games, and the 3D game Minecraft. We find that all three intrinsic objectives correlate more strongly with a human behavior similarity metric than with task reward. Moreover, input entropy and information gain correlate more strongly with human similarity than task reward does, suggesting the use of intrinsic objectives for designing agents that behave similarly to human players.