LGAIMay 26, 2022

Evaluating Multimodal Interactive Agents

arXiv:2205.13274v27 citationsh-index: 69
Originality Incremental advance
AI Analysis

This addresses the problem of slow and expensive evaluation for AI researchers developing human-like interactive agents, offering an incremental improvement over existing metrics.

The paper tackles the challenge of evaluating multimodal interactive agents by introducing the Standardised Test Suite (STS), which uses behavioral scenarios from real human interactions to rank agents based on success rates in offline continuations, resulting in a fast, controlled, and interpretable evaluation method.

Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. A video may be found at https://youtu.be/YR1TngGORGQ.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes