LGHCMAJun 2, 2021

Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

arXiv:2106.01254v214 citations
Originality Highly original
AI Analysis

This addresses the challenge of assessing AI systems in domains like subjective decision-making where ground truth is inaccessible, offering a practical evaluation method.

The paper tackles the problem of evaluating classifiers when definitive ground truth is unavailable by introducing a framework based on human judgments, quantifying performance through rater equivalence—the smallest number of human raters needed to match the classifier's performance.

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes