CLAIJul 27, 2021

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

arXiv:2107.12708v2195 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of dataset proliferation for practitioners and researchers in NLP, offering a structured overview to guide resource selection and future development, though it is incremental as a survey.

The study provides a comprehensive survey of over 80 new question answering and reading comprehension datasets from the past two years, analyzing their formats, domains, and gaps, and proposes a new taxonomy for classification.

Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of "skills" that question answering/reading comprehension systems are supposed to acquire, and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of over-focusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes