CLDec 20, 2022

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

Microsoft
arXiv:2212.10180v2235 citationsh-index: 41
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of under-evaluated machine translation metrics for Indian languages, which affects over a billion speakers, but it is incremental as it primarily provides a dataset and analysis without introducing new methods.

The authors tackled the lack of systematic evaluation of machine translation metrics for Indian languages by creating an MQM dataset with 7000 annotations across 5 languages and 7 systems, finding that pre-trained metrics like COMET have the highest correlations with human scores but still fail to capture fluency-based errors.

The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes