CLJan 9Code
Afri-MCQA: Multimodal Cultural Question Answering for African LanguagesAtnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva et al.
Africa is home to over one-third of the world's languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)
CLMar 23, 2025
Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language ModelsMuhidin A. Mohamed, Shuab D. Ahmed, Yahye A. Isse et al.
The fact that everyone with a social media account can create and share content, and the increasing public reliance on social media platforms as a news and information source bring about significant challenges such as misinformation, fake news, harmful content, etc. Although human content moderation may be useful to an extent and used by these platforms to flag posted materials, the use of AI models provides a more sustainable, scalable, and effective way to mitigate these harmful contents. However, low-resourced languages such as the Somali language face limitations in AI automation, including scarce annotated training datasets and lack of language models tailored to their unique linguistic characteristics. This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language. In particular, we created two human-annotated social-media-sourced Somali datasets for two downstream applications, fake news \& toxicity classification, and developed a transformer-based monolingual Somali language model (named SomBERTa) -- the first of its kind to the best of our knowledge. SomBERTa is then fine-tuned and evaluated on toxic content, fake news and news topic classification datasets. Comparative evaluation analysis of the proposed model against related multilingual models (e.g., AfriBERTa, AfroXLMR, etc) demonstrated that SomBERTa consistently outperformed these comparators in both fake news and toxic content classification tasks while achieving the best average accuracy (87.99%) across all tasks. This research contributes to Somali NLP by offering a foundational language model and a replicable framework for other low-resource languages, promoting digital and AI inclusivity and linguistic diversity.