CLOct 14, 2024

Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation

Emmanouil Zaranis, Giuseppe Attanasio, Sweta Agrawal, André F. T. Martins

arXiv:2410.10995v47.210 citationsh-index: 20Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses a critical fairness issue in machine translation for users and developers, exposing how biased QE metrics can exacerbate gender disparities, making it a significant but incremental contribution to bias detection in NLP.

The paper investigates gender bias in machine translation quality estimation (QE) metrics, finding that they systematically favor masculine-inflected translations over feminine or gender-neutral ones, and that biased QE metrics lead to more errors for feminine referents and affect downstream tasks like data filtering.

Quality estimation (QE)-the automatic assessment of translation quality-has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. While QE metrics have been optimized to align with human judgments, whether they encode social biases has been largely overlooked. Biased QE risks favoring certain demographic groups over others, e.g., by exacerbating gaps in visibility and usability. This paper defines and investigates gender bias of QE metrics and discusses its downstream implications for machine translation (MT). Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. When a human entity's gender in the source is undisclosed, masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Even when contextual cues disambiguate gender, using context-aware QE metrics leads to more errors in selecting the correct translation inflection for feminine referents than for masculine ones. Moreover, a biased QE metric affects data filtering and quality-aware decoding. Our findings underscore the need for a renewed focus on developing and evaluating QE metrics centered on gender.

View on arXiv PDF Code

Similar