CLApr 12, 2022

The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts

U of Toronto
arXiv:2204.05836v1591 citationsh-index: 53
Originality Synthesis-oriented
AI Analysis

This provides a resource for researchers in natural language processing and literary analysis to evaluate models for quotation attribution and coreference in English literary texts, though it is incremental as it focuses on dataset creation.

The authors tackled the problem of quotation attribution in literary texts by creating the Project Dialogism Novel Corpus (PDNC), which includes annotations for 35,978 quotations across 22 novels, making it the largest such dataset available.

We present the Project Dialogism Novel Corpus, or PDNC, an annotated dataset of quotations for English literary texts. PDNC contains annotations for 35,978 quotations across 22 full-length novels, and is by an order of magnitude the largest corpus of its kind. Each quotation is annotated for the speaker, addressees, type of quotation, referring expression, and character mentions within the quotation text. The annotated attributes allow for a comprehensive evaluation of models of quotation attribution and coreference for literary texts.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes