CLNov 1, 2023

Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks

arXiv:2311.00508v121.4134 citationsh-index: 6Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of unreliable automatic evaluation for machine translation practitioners, though it is incremental as it builds on existing adversarial testing methods.

The paper investigated the robustness of machine translation metrics by testing them on adversarially-synthesized texts, finding that metrics like BERTScore, BLEURT, and COMET overpenalize degraded translations and show inconsistencies in ratings.

We investigate MT evaluation metric performance on adversarially-synthesized texts, to shed light on metric robustness. We experiment with word- and character-level attacks on three popular machine translation metrics: BERTScore, BLEURT, and COMET. Our human experiments validate that automatic metrics tend to overpenalize adversarially-degraded translations. We also identify inconsistencies in BERTScore ratings, where it judges the original sentence and the adversarially-degraded one as similar, while judging the degraded translation as notably worse than the original with respect to the reference. We identify patterns of brittleness that motivate more robust metric development.

View on arXiv PDF Code

Similar