CLApr 21

Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

arXiv:2604.1739385.61 citationsh-index: 14
Predicted impact top 49% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing MT evaluation metrics, the paper highlights the importance of comparing metric-human agreement against inter-annotator agreement when evaluating across domains, revealing that current metrics are less robust than previously thought.

The paper investigates the robustness of machine translation evaluation metrics under domain shift, finding that while metrics appear robust at the segment level, this robustness disappears when accounting for human label variation. Metrics struggle on unseen chemical domains, with inter-annotator agreement of 0.78-0.83 vs. 0.96 for humans.

Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise. To address these biases, we introduce a systematic multi-annotator Cross-Domain Error-Span-Annotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78-0.83 vs. 0.96). We recommend comparing metric-human agreement against inter-annotator agreement, rather than comparing raw metric-human agreement alone, when evaluating across different domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes