LG HC MAJun 2, 2021

Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger

arXiv:2106.01254v210.614 citationsHas Code

Originality Highly original

AI Analysis

This addresses the challenge of assessing AI systems in domains like subjective decision-making where ground truth is inaccessible, offering a practical evaluation method.

The paper tackles the problem of evaluating classifiers when definitive ground truth is unavailable by introducing a framework based on human judgments, quantifying performance through rater equivalence—the smallest number of human raters needed to match the classifier's performance.

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

View on arXiv PDF Code

Similar