Know What You Don't Know: Unanswerable Questions for SQuAD
This addresses the challenge of improving natural language understanding systems for researchers and practitioners by creating a more robust benchmark, though it is incremental as it builds on the existing SQuAD dataset.
The paper tackles the problem of extractive reading comprehension systems making unreliable guesses on unanswerable questions by introducing SQuAD 2.0, a dataset combining existing data with over 50,000 adversarially written unanswerable questions, resulting in a strong neural system's F1 score dropping from 86% on SQuAD 1.1 to 66% on SQuAD 2.0.
Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.