Releasing the CRaQAn (Coreference Resolution in Question-Answering): An open-source dataset and dataset creation methodology using instruction-following models
This addresses a critical barrier for researchers and practitioners in natural language processing by enabling experimentation with coreference resolution in QA applications, though it is incremental as it focuses on dataset creation rather than a new model or algorithm.
The paper tackles the lack of open-source datasets for coreference resolution in question-answering tasks by introducing the CRaQAn dataset, which provides over 250 question-answer pairs with coreferences, developed using a novel methodology involving GPT-4 and a Recursive Criticism and Improvement Loop.
Instruction-following language models demand robust methodologies for information retrieval to augment instructions for question-answering applications. A primary challenge is the resolution of coreferences in the context of chunking strategies for long documents. The critical barrier to experimentation of handling coreferences is a lack of open source datasets, specifically in question-answering tasks that require coreference resolution. In this work we present our Coreference Resolution in Question-Answering (CRaQAn) dataset, an open-source dataset that caters to the nuanced information retrieval requirements of coreference resolution in question-answering tasks by providing over 250 question-answer pairs containing coreferences. To develop this dataset, we developed a novel approach for creating high-quality datasets using an instruction-following model (GPT-4) and a Recursive Criticism and Improvement Loop.