CLNov 1, 2023

Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks

arXiv:2311.00508v1134 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the problem of unreliable automatic evaluation for machine translation practitioners, though it is incremental as it builds on existing adversarial testing methods.

The paper investigated the robustness of machine translation metrics by testing them on adversarially-synthesized texts, finding that metrics like BERTScore, BLEURT, and COMET overpenalize degraded translations and show inconsistencies in ratings.

We investigate MT evaluation metric performance on adversarially-synthesized texts, to shed light on metric robustness. We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We also identify inconsistencies in BERTScore ratings, where it judges the original sentence and the adversarially-degraded one as similar, while judging the degraded translation as notably worse than the original with respect to the reference. We identify patterns of brittleness that motivate more robust metric development.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes