CLLGOct 21, 2022

AfroLID: A Neural Language Identification Tool for African Languages

arXiv:2210.11744v3306 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the pressing issue of limited language identification coverage for African languages, which is crucial for NLP applications like web data mining, though it is incremental as it extends existing methods to a new domain.

The authors tackled the problem of language identification for under-served African languages by introducing AfroLID, a neural toolkit covering 517 languages and varieties, which achieved a 95.89 F1-score on a blind test set and outperformed existing tools.

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes