CLMar 31, 2024

A Controlled Reevaluation of Coreference Resolution Models

MILA
arXiv:2404.00727v284 citationsh-index: 6LREC
AI Analysis

This work addresses the challenge of disentangling factors in coreference resolution model comparisons for researchers, revealing incremental insights into model performance.

The paper tackled the problem of ambiguous performance improvements in coreference resolution models by systematically evaluating five models while controlling for language model choices, finding that encoder-based models outperform decoder-based ones in accuracy and speed, and that older models sometimes generalize better, reducing most reported F1 score gains over five years.

All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically evaluate five CR models and control for certain design decisions including the pretrained language model used by each. When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed. Surprisingly, among encoder-based CR models, more recent models are not always more accurate, and the oldest CR model that we test generalizes the best to out-of-domain textual genres. We conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes