Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
This addresses a critical bias in machine translation evaluation that can unfairly penalize longer, correct translations, impacting applications like reranking and reinforcement learning, though it is incremental in improving existing metrics.
The study tackled systematic length bias in Quality Estimation metrics for machine translation, revealing that these metrics over-predict errors with longer translations and prefer shorter ones, and proposed strategies like length normalization and reference incorporation that effectively reduced the bias.
Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.