AIJun 2, 2025

The State of Large Language Models for African Languages: Progress and Challenges

Kedir Yassin Hussen, Walelign Tewabe Sewunetie, Abinew Ali Ayele, Sukairaj Hafiz Imam, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam

arXiv:2506.02280v314.710 citationsh-index: 19

Originality Synthesis-oriented

AI Analysis

It highlights a critical problem for speakers of low-resource African languages, who are largely excluded from the benefits of NLP advancements, and is incremental as it reviews and identifies gaps rather than proposing new solutions.

This paper analyzes the coverage of African languages in large language models, finding that only 42 out of approximately 2,000 languages are supported, with a significant gap where over 98% remain unsupported and only three scripts are identified while 20 are neglected.

Large Language Models (LLMs) are transforming Natural Language Processing (NLP), but their benefits are largely absent for Africa's 2,000 low-resource languages. This paper comparatively analyzes African language coverage across six LLMs, eight Small Language Models (SLMs), and six Specialized SLMs (SSLMs). The evaluation covers language coverage, training sets, technical limitations, script problems, and language modelling roadmaps. The work identifies 42 supported African languages and 23 available public data sets, and it shows a big gap where four languages (Amharic, Swahili, Afrikaans, and Malagasy) are always treated while there is over 98\% of unsupported African languages. Moreover, the review shows that just Latin, Arabic, and Ge'ez scripts are identified while 20 active scripts are neglected. Some of the primary challenges are lack of data, tokenization biases, computational costs being very high, and evaluation issues. These issues demand language standardization, corpus development by the community, and effective adaptation methods for African languages.

View on arXiv PDF

Similar