AILGJun 4, 2022

Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL

arXiv:2206.02039v21 citationsh-index: 46
Originality Incremental advance
AI Analysis

This addresses the problem of improving evaluation and generalization for RL agents, particularly in planning-based settings, but it is incremental as it adapts an existing testing method from another domain.

The paper tackles the limited evidence for post-deployment generalization in RL by extending the CheckList testing methodology from NLP to planning-based RL, allowing users to identify previously-unknown flaws in an agent's reasoning during tree search, as shown in a user study with AI researchers evaluating an agent in a complex real-time strategy game.

Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes