Long-context Reference-based MT Quality Estimation
This work addresses translation quality evaluation for machine translation researchers and practitioners, but it is incremental as it builds upon existing frameworks and datasets.
The paper tackled the problem of machine translation quality estimation by developing systems based on the COMET framework, using augmented long-context data to predict segment-level Error Span Annotation scores, and found that incorporating long-context information improved correlations with human judgments compared to short-segment models.
In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level Error Span Annotation (ESA) scores using augmented long-context data. To construct long-context training data, we concatenate in-domain, human-annotated sentences and compute a weighted average of their scores. We integrate multiple human judgment datasets (MQM, SQM, and DA) by normalising their scales and train multilingual regression models to predict quality scores from the source, hypothesis, and reference translations. Experimental results show that incorporating long-context information improves correlations with human judgments compared to models trained only on short segments.