CLFeb 19

Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya

arXiv:2602.17425v11.11 citationsh-index: 16

Originality Synthesis-oriented

AI Analysis

It addresses evaluation challenges for machine translation in extremely low-resource languages, which is incremental as it builds on existing metrics.

This study compared BLEU and ChrF++ metrics for evaluating machine translation in extremely low-resource languages like Magahi, Bhojpuri, and Chhattisgarhi, finding that BLEU offers complementary lexical-precision insights despite lower scores.

Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.

View on arXiv PDF

Similar