CL LGFeb 4, 2024

Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity

Eric Khiu, Hasti Toossi, David Anugraha, Jinyu Liu, Jiaxu Li, Juan Armando Parra Flores, Leandro Acros Roman, A. Seza Doğruöz, En-Shiun Annie Lee

arXiv:2402.02633v127.6110 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the problem of expensive and challenging fine-tuning for low-resource languages, though it is incremental as it builds on existing prediction methods by focusing on overlooked aspects.

The study tackled the challenge of predicting machine translation performance for low-resource languages by analyzing factors like fine-tuning corpus size, domain similarity, and language similarity, finding that domain similarity has the most critical impact.

Fine-tuning and testing a multilingual large language model is expensive and challenging for low-resource languages (LRLs). While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors: the size of the fine-tuning corpus, the domain similarity between fine-tuning and testing corpora, and the language similarity between source and target languages. We employ classical regression models to assess how these factors impact the model's performance. Our results indicate that domain similarity has the most critical impact on predicting the performance of Machine Translation models.

View on arXiv PDF

Similar