CLAug 21, 2019

WikiCREM: A Large Unsupervised Corpus for Coreference Resolution

Vid Kocijan, Oana-Maria Camburu, Ana-Maria Cretu, Yordan Yordanov, Phil Blunsom, Thomas Lukasiewicz

arXiv:1908.08025v330.21008 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses the problem of limited labeled data for coreference resolution in NLP, providing a useful resource and model for researchers and practitioners, though it is incremental in method.

The paper tackles the scarcity of large-scale training data for pronoun resolution by introducing WikiCREM, a large unsupervised corpus, and uses a language-model-based approach to match or outperform previous state-of-the-art on 6 out of 7 coreference resolution datasets.

Pronoun resolution is a major area of natural language understanding. However, large-scale training sets are still scarce, since manually labelling data is costly. In this work, we introduce WikiCREM (Wikipedia CoREferences Masked) a large-scale, yet accurate dataset of pronoun disambiguation instances. We use a language-model-based approach for pronoun resolution in combination with our WikiCREM dataset. We compare a series of models on a collection of diverse and challenging coreference resolution problems, where we match or outperform previous state-of-the-art approaches on 6 out of 7 datasets, such as GAP, DPR, WNLI, PDP, WinoBias, and WinoGender. We release our model to be used off-the-shelf for solving pronoun disambiguation.

View on arXiv PDF Code

Similar