Building a Functional Machine Translation Corpus for Kpelle
This work addresses the problem of low-resource language technology for Kpelle speakers, though it is incremental as it applies an existing method to new data.
The authors tackled the lack of machine translation resources for Kpelle by creating the first publicly available English-Kpelle dataset with over 2000 sentence pairs, achieving BLEU scores up to 30 through fine-tuning an existing model with data augmentation.
In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.