CLAIDec 21, 2022

SERENGETI: Massively Multilingual Language Models for Africa

arXiv:2212.10785v2241 citationsh-index: 19Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of language model accessibility for African languages, which are underrepresented in NLP, representing a strong specific gain in this domain.

The paper tackles the limited coverage of African languages in existing multilingual language models by developing SERENGETI, which covers 517 African languages and outperforms other models on 11 out of 20 datasets, achieving an average F1 score of 82.27.

Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research.\footnote{\href{https://github.com/UBC-NLP/serengeti}{https://github.com/UBC-NLP/serengeti}}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes