CLMay 1, 2020

KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

arXiv:2005.00192v3730 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of automatic evaluation for generative question answering, which is incremental as it improves upon existing metrics by focusing on key meaning.

The authors tackled the problem of evaluating generative question answering systems by proposing KPQA, a metric that uses keyphrase weights to assess answer correctness, showing it has significantly higher correlation with human judgments than existing metrics on two datasets.

In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA-metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics. The code is available at https://github.com/hwanheelee1993/KPQA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes