CLMay 28, 2016

Building an Evaluation Scale using Item Response Theory

arXiv:1605.08889v233 citations
Originality Incremental advance
AI Analysis

This addresses the need for more nuanced evaluation in NLP by offering a method that compares systems to human performance, though it is incremental as it applies an existing psychometric theory to a new domain.

The authors tackled the problem of evaluating NLP systems by proposing Item Response Theory (IRT) as an alternative to standard metrics, demonstrating that IRT provides more insight into system performance by accounting for item difficulty and discriminating power, and showing that high accuracy does not always correlate with high IRT scores.

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes