CLFeb 17, 2023

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

arXiv:2302.08956v5172 citationsh-index: 56
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited annotated datasets for researchers working on African languages, though it is incremental as it builds on existing sentiment analysis benchmarks.

The authors tackled the lack of NLP resources for African languages by introducing AfriSenti, a sentiment analysis benchmark with over 110,000 tweets in 14 languages, which was used in a shared task with over 200 participants.

Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (The AfriSenti Shared Task had over 200 participants. See website at https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the different datasets and discuss their usefulness.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes