CLAISep 24, 2025

Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks

arXiv:2509.20209v1h-index: 3Has Code2025 3rd International Conference on Foundation and Large Language Models (FLLM)
Originality Incremental advance
AI Analysis

This work addresses the problem of underserved low-resource languages like Tigrinya for NLP researchers and practitioners, though it is incremental in applying existing techniques to a specific language.

The paper tackled low-resource English-Tigrinya machine translation by using transfer learning with custom tokenizers and a new evaluation dataset, achieving substantial gains over baselines as validated by BLEU, chrF, and human evaluation.

Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at https://github.com/hailaykidu/MachineT_TigEng and https://huggingface.co/Hailay/MachineT_TigEng

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes