CLMay 15, 2018

Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia

arXiv:1805.05942v11170 citations
Originality Incremental advance
AI Analysis

This addresses the need for large-scale, high-quality QA datasets for natural language processing tasks, though it is incremental as it builds on existing sentence-level approaches.

The paper tackles the problem of generating paragraph-level question-answer pairs from Wikipedia by proposing a neural network with a coreference gating mechanism, resulting in models that outperform state-of-the-art methods and creating a corpus of over one million pairs from 10,000 articles.

We study the task of generating from Wikipedia articles question-answer pairs that cover content beyond a single sentence. We propose a neural network approach that incorporates coreference knowledge via a novel gating mechanism. Compared to models that only take into account sentence-level information (Heilman and Smith, 2010; Du et al., 2017; Zhou et al., 2017), we find that the linguistic knowledge introduced by the coreference representation aids question generation significantly, producing models that outperform the current state-of-the-art. We apply our system (composed of an answer span extraction system and the passage-level QG system) to the 10,000 top-ranking Wikipedia articles and create a corpus of over one million question-answer pairs. We also provide a qualitative analysis for this large-scale generated corpus from Wikipedia.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes