CLOct 11, 2018

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns

arXiv:1810.05201v11188 citations
Originality Incremental advance
AI Analysis

This addresses a coreference resolution bottleneck for NLP practitioners by providing a more balanced dataset, though it is incremental as it builds on existing corpus efforts.

The authors tackled the lack of diverse and gender-balanced data for ambiguous pronoun resolution by creating GAP, a corpus of 8,908 pronoun-name pairs, with the best baseline model achieving only 66.9% F1 score.

Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun-name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines which demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes