CLMay 29, 2023

A Critical Evaluation of Evaluations for Long-form Question Answering

arXiv:2305.18201v1276 citations
Originality Incremental advance
AI Analysis

This addresses a critical evaluation bottleneck for researchers and practitioners in natural language processing working on long-form question answering, though it is incremental in providing analysis rather than new methods.

The paper tackles the problem of evaluating long-form question answering systems by conducting the first targeted study comparing human and automatic evaluation methods, finding that no existing automatic metrics predict human preferences but some correlate with specific answer aspects like coherence.

Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts' evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text generation metrics, finding that no existing metrics are predictive of human preference judgments. However, some metrics correlate with fine-grained aspects of answers (e.g., coherence). We encourage future work to move away from a single "overall score" of the answer and adopt a multi-faceted evaluation, targeting aspects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes