CLJan 8, 2025

When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

arXiv:2501.04473v123 citationsh-index: 30COLING Workshops
AI Analysis

It addresses a critical challenge in machine translation evaluation for low-resource languages, though the results are incremental as they show limitations of current LLMs rather than a breakthrough.

This paper tackled the problem of evaluating machine translation quality for low-resource languages without reference translations, finding that prompt-based large language models underperformed compared to fine-tuned encoder models, with error analysis highlighting tokenization and transliteration issues.

This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes