CLApr 29, 2022

QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance

MILA
arXiv:2204.13921v1295 citationsh-index: 44
Originality Incremental advance
AI Analysis

This addresses the need for more accurate evaluation in question generation, particularly for tasks involving complex reasoning or multiple evidence sources, though it is incremental as it builds on existing language models.

The paper tackles the problem of evaluating generated questions by proposing QRelScore, a context-aware metric that better correlates with human judgments and is more robust to adversarial samples than existing methods.

Existing metrics for assessing question generation not only require costly human reference but also fail to take into account the input context of generation, rendering the lack of deep understanding of the relevance between the generated questions and input contexts. As a result, they may wrongly penalize a legitimate and reasonable candidate question when it (i) involves complicated reasoning with the context or (ii) can be grounded by multiple evidences in the context. In this paper, we propose $\textbf{QRelScore}$, a context-aware $\underline{\textbf{Rel}}$evance evaluation metric for $\underline{\textbf{Q}}$uestion Generation. Based on off-the-shelf language models such as BERT and GPT2, QRelScore employs both word-level hierarchical matching and sentence-level prompt-based generation to cope with the complicated reasoning and diverse generation from multiple evidences, respectively. Compared with existing metrics, our experiments demonstrate that QRelScore is able to achieve a higher correlation with human judgments while being much more robust to adversarial samples.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes