CLAIApr 12, 2021

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique

arXiv:2104.05753v18 citations
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited NLP resources for speakers and researchers of Emakhuwa, an African language, though it is incremental as it builds on existing data sources.

The paper tackles the lack of parallel corpora for low-resource languages by creating an Emakhuwa-Portuguese parallel corpus, resulting in a dataset of 47,415 sentence pairs with 699,976 Emakhuwa and 877,595 Portuguese word tokens.

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes