CLMay 1, 2020

KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Joongbo Shin, Kyomin Jung

arXiv:2005.00192v327.9730 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of automatic evaluation for generative question answering, which is incremental as it improves upon existing metrics by focusing on key meaning.

The authors tackled the problem of evaluating generative question answering systems by proposing KPQA, a metric that uses keyphrase weights to assess answer correctness, showing it has significantly higher correlation with human judgments than existing metrics on two datasets.

In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA-metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics. The code is available at https://github.com/hwanheelee1993/KPQA.

View on arXiv PDF Code

Similar