CLFeb 19

Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference

arXiv:2602.17424v11 citationsh-index: 23
Originality Incremental advance
AI Analysis

This work addresses the need for more diverse and balanced datasets in cross-document coreference resolution for analyzing polarized news media, though it is incremental as it builds on existing datasets with a new annotation approach.

The paper tackles the problem of limited lexical diversity in cross-document coreference resolution datasets by proposing a revised annotation scheme for NewsWCL50 and a subset of ECB+, which accommodates identity and near-identity relations to capture wording variations in news coverage, resulting in reannotated datasets that align closely and support discourse-aware research.

Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes