CLAIFeb 17, 2024

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

arXiv:2402.11161v543 citationsh-index: 8EMNLP
Originality Incremental advance
AI Analysis

This addresses the challenge of answer correctness metrics for QA systems, particularly for LLMs, by providing a cheaper and more interpretable alternative to expensive LLM-based scorers.

The paper tackles the problem of evaluating verbose, free-form answers from large language models in question answering by introducing rubrics and datasets from the Trivia community, and proposes an efficient and interpretable evaluation method that is more stable than exact match and BERTScore.

Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes