CLJun 6, 2024

How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?

arXiv:2406.03893v127 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of reliable evaluation for low-resource languages, which is crucial for improving machine translation systems in under-resourced contexts, but the findings are incremental as they highlight existing gaps rather than solving them.

The paper tackled the problem of evaluating machine translation for low-resource Indian languages in a zero-shot setting, finding that automatic metrics achieved only up to 0.32 Kendall Tau and 0.45 Pearson correlations with human annotations, indicating poor performance.

While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-Dimensional Quality Metrics (MQM) and Direct Assessment (DA) annotations to create test sets and meta-evaluate a plethora of automatic evaluation metrics. We observe that even for learned metrics, which are known to exhibit zero-shot performance, the Kendall Tau and Pearson correlations with human annotations are only as high as 0.32 and 0.45. Synthetic data approaches show mixed results and overall do not help close the gap by much for these languages. This indicates that there is still a long way to go for low-resource evaluation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes