CLAILGAug 31, 2019

Evaluating Pronominal Anaphora in Machine Translation: An Evaluation Measure and a Test Suite

arXiv:1909.00131v11004 citations
Originality Incremental advance
AI Analysis

This work addresses the need for specialized evaluation in machine translation to better assess discourse-level improvements, particularly for researchers and practitioners focusing on pronoun handling.

The paper tackles the problem that traditional automatic evaluation measures like BLEU fail to detect improvements in pronominal anaphora resolution in machine translation, even when humans perceive substantial gains. It contributes a targeted dataset for testing pronoun translation across multiple source languages and proposes a new evaluation measure that correlates with human judgments.

The ongoing neural revolution in machine translation has made it easier to model larger contexts beyond the sentence-level, which can potentially help resolve some discourse-level ambiguities such as pronominal anaphora, thus enabling better translations. Unfortunately, even when the resulting improvements are seen as substantial by humans, they remain virtually unnoticed by traditional automatic evaluation measures like BLEU, as only a few words end up being affected. Thus, specialized evaluation measures are needed. With this aim in mind, we contribute an extensive, targeted dataset that can be used as a test suite for pronoun translation, covering multiple source languages and different pronoun errors drawn from real system translations, for English. We further propose an evaluation measure to differentiate good and bad pronoun translations. We also conduct a user study to report correlations with human judgments.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes