CLMar 7, 2023

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab, Andani Madodonga, Matimba Shingange, Daniel Njini, Vukosi Marivate

arXiv:2303.03750v226.9230 citationsh-index: 16Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for multilingual NLP resources in South African languages, enabling research on government communication, but it is incremental as it applies existing methods to new data.

The paper introduces two multilingual corpora of South African government texts, covering 11 official languages, and provides Neural Machine Translation benchmarks for 9 indigenous languages by fine-tuning a pre-trained model.

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk'uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation (NMT) tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively multilingual pre-trained language model.

View on arXiv PDF Code

Similar