CLJun 10, 2018

Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task

arXiv:1806.03578v11095 citations
Originality Synthesis-oriented
AI Analysis

This work addresses evaluation issues for machine reading comprehension systems, but it is incremental as it adapts existing metrics for specific question types.

The paper tackled the problem of bias in ROUGE and BLEU metrics when evaluating machine reading comprehension systems for yes-no and entity list question types, and showed through statistical analysis that their adaptations improved correlation with human judgment.

Current evaluation metrics to question answering based machine reading comprehension (MRC) systems generally focus on the lexical overlap between the candidate and reference answers, such as ROUGE and BLEU. However, bias may appear when these metrics are used for specific question types, especially questions inquiring yes-no opinions and entity lists. In this paper, we make adaptations on the metrics to better correlate n-gram overlap with the human judgment for answers to these two question types. Statistical analysis proves the effectiveness of our approach. Our adaptations may provide positive guidance for the development of real-scene MRC systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes