CL AIJun 30, 2025

Natural language processing for African languages

arXiv:2507.00297v16 citationsh-index: 31

Originality Incremental advance

AI Analysis

It addresses the problem of low-resource language processing for African languages, which is incremental in building on existing multilingual methods but with novel applications and datasets.

This dissertation tackles the under-representation of African languages in NLP by analyzing noise in corpora, curating high-quality data, and developing labeled datasets for 21 languages, showing that data quality matters more than quantity for word embeddings and that multilingual PLMs can be adapted with minimal monolingual text.

Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.

View on arXiv PDF

Similar