The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts
This provides a resource for researchers in natural language processing and literary analysis to evaluate models for quotation attribution and coreference in English literary texts, though it is incremental as it focuses on dataset creation.
The authors tackled the problem of quotation attribution in literary texts by creating the Project Dialogism Novel Corpus (PDNC), which includes annotations for 35,978 quotations across 22 novels, making it the largest such dataset available.
We present the Project Dialogism Novel Corpus, or PDNC, an annotated dataset of quotations for English literary texts. PDNC contains annotations for 35,978 quotations across 22 full-length novels, and is by an order of magnitude the largest corpus of its kind. Each quotation is annotated for the speaker, addressees, type of quotation, referring expression, and character mentions within the quotation text. The annotated attributes allow for a comprehensive evaluation of models of quotation attribution and coreference for literary texts.