CLMay 7, 2021

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

arXiv:2105.03011v1819 citations
Originality Synthesis-oriented
AI Analysis

This dataset addresses the problem of building efficient question answering tools for NLP practitioners reading academic papers, though it is incremental as it focuses on a specific domain.

The authors tackled the lack of datasets for complex question answering in research papers by introducing QASPER, a dataset of 5,049 questions over 1,585 NLP papers, where existing models underperformed humans by at least 27 F1 points.

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes