CLSep 29, 2025

Don't Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation

arXiv:2509.25546v11 citationsh-index: 10EMNLP

Originality Incremental advance

AI Analysis

This addresses the need for more robust meta-evaluation metrics in machine translation, though it appears incremental as it refines existing correlation-based approaches.

The paper tackles the problem of evaluating machine translation evaluation metrics by introducing Pairwise Difference Pearson (PDP), a segment-level meta-evaluation metric that uses pairwise differences instead of raw scores. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous approaches.

This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that address limitations in previous Pearson's $ρ$-based and and Kendall's $τ$-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses segment-wise pairwise differences to refine Global Pearson to intra-segment score comparisons. Analysis on the WMT'24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than previous work. Noise injection analysis demonstrates PDP's robustness to random noise, segment bias, and system bias while highlighting its sensitivity to extreme outliers.

View on arXiv PDF

Similar