Are Large Language Models Robust Coreference Resolvers?
This addresses coreference resolution robustness for NLP applications, but it is incremental as it builds on existing prompt-based methods.
The paper assesses the feasibility of using prompt-based large language models for coreference resolution on difficult benchmarks like CoNLL-2012, showing they can outperform current unsupervised systems but rely on high-quality mention detectors and that fine-tuning is preferred with small annotated data.
Recent work on extending coreference resolution across domains and languages relies on annotated data in both the target domain and language. At the same time, pre-trained large language models (LMs) have been reported to exhibit strong zero- and few-shot learning abilities across a wide range of NLP tasks. However, prior work mostly studied this ability using artificial sentence-level datasets such as the Winograd Schema Challenge. In this paper, we assess the feasibility of prompt-based coreference resolution by evaluating instruction-tuned language models on difficult, linguistically-complex coreference benchmarks (e.g., CoNLL-2012). We show that prompting for coreference can outperform current unsupervised coreference systems, although this approach appears to be reliant on high-quality mention detectors. Further investigations reveal that instruction-tuned LMs generalize surprisingly well across domains, languages, and time periods; yet continued fine-tuning of neural models should still be preferred if small amounts of annotated examples are available.