CLAIMay 21, 2021

CEREC: A Corpus for Entity Resolution in Email Conversations

arXiv:2105.10606v2991 citations
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for researchers working on entity resolution in email data, though it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the lack of a large-scale dataset for entity resolution in email conversations by creating CEREC, a corpus of 6,001 email threads with 60,383 entity coreference chains, and reported a best baseline performance of 59.2 F1 for mention identification and coreference resolution.

We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 59.2 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes