CLFeb 28, 2022

'Tis but Thy Name: Semantic Question Answering Evaluation with 11M Names for 1M Entities

arXiv:2202.13581v1

Originality Incremental advance

AI Analysis

This addresses the need for more accurate semantic evaluation metrics in QA, though it is incremental as it builds on existing neural metric approaches.

The paper tackles the problem of evaluating question answering systems by creating Wiki Entity Similarity (WES), an 11M example dataset of semantic entity similarities from Wikipedia link texts, which better predicts human judgments than classic metrics.

Classic lexical-matching-based QA metrics are slowly being phased out because they punish succinct or informative outputs just because those answers were not provided as ground truth. Recently proposed neural metrics can evaluate semantic similarity but were trained on small textual similarity datasets grafted from foreign domains. We introduce the Wiki Entity Similarity (WES) dataset, an 11M example, domain targeted, semantic entity similarity dataset that is generated from link texts in Wikipedia. WES is tailored to QA evaluation: the examples are entities and phrases and grouped into semantic clusters to simulate multiple ground-truth labels. Human annotators consistently agree with WES labels, and a basic cross encoder metric is better than four classic metrics at predicting human judgments of correctness.

View on arXiv PDF

Similar