CLApr 3, 2025

Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

arXiv:2504.02674v111 citationsh-index: 26Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)
Originality Synthesis-oriented
AI Analysis

This work addresses machine translation for a low-resource creole language, highlighting the importance of small-scale data collection for domain transfer, but it is incremental as it applies existing methods to a new dataset.

The authors tackled the problem of machine translation for low-resource Guinea-Bissau Creole (Kiriol) by introducing a new dataset of about 40,000 parallel sentences and found that adding just 300 sentences from the target domain significantly improves translation performance, with Portuguese-to-Kiriol models performing best.

We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes