CLOct 17, 2025

The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

arXiv:2510.15594v110.94 citationsh-index: 2Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of coreference resolution in long, complex literary works for computational literature researchers and NLP practitioners, though it is incremental as it builds on existing methods with new data.

The paper tackles the scarcity of annotated long documents for coreference resolution by introducing a new corpus of three full-length French novels with over 285,000 tokens, and presents a modular pipeline that is competitive and scales effectively to long texts, enabling applications like inferring character gender for literary analysis and NLP tasks.

While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

View on arXiv PDF

Similar