AIAug 29, 2014

AI Evaluation: past, present and future

arXiv:1408.6908v324 citations
AI Analysis

This work addresses the challenge of effectively evaluating increasingly complex AI systems for researchers and practitioners, but it is incremental as it builds on existing evaluation paradigms.

The paper examines the evolution of AI evaluation methods, highlighting a shift from traditional task-oriented approaches to more complex behavioral and ability-oriented frameworks, and proposes ideas for more systematic and robust evaluation without presenting specific numerical results.

Artificial intelligence develops techniques and systems whose performance must be evaluated on a regular basis in order to certify and foster progress in the discipline. We will describe and critically assess the different ways AI systems are evaluated. We first focus on the traditional task-oriented evaluation approach. We see that black-box (behavioural evaluation) is becoming more and more common, as AI systems are becoming more complex and unpredictable. We identify three kinds of evaluation: Human discrimination, problem benchmarks and peer confrontation. We describe the limitations of the many evaluation settings and competitions in these three categories and propose several ideas for a more systematic and robust evaluation. We then focus on a less customary (and challenging) ability-oriented evaluation approach, where a system is characterised by its (cognitive) abilities, rather than by the tasks it is designed to solve. We discuss several possibilities: the adaptation of cognitive tests used for humans and animals, the development of tests derived from algorithmic information theory or more general approaches under the perspective of universal psychometrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes