CLApr 30, 2024

Does Generative AI speak Nigerian-Pidgin?: Issues about Representativeness and Bias for Multilingualism in LLMs

arXiv:2404.19442v512 citationsh-index: 31NAACL
Originality Incremental advance
AI Analysis

This addresses bias and representativeness issues in multilingual AI for a large population of approximately 120 million Naija speakers, highlighting incremental insights into linguistic diversity.

The paper tackles the underrepresentation of Nigerian Pidgin (Naija) in generative AI, showing through statistical analyses and machine translation experiments that AI models primarily operate on West African Pidgin English (WAPE) and fail to represent Naija, with linguistic differences in word order and vocabulary making it hard to teach LLMs with few examples.

Nigeria is a multilingual country with 500+ languages. Naija is a Nigerian Pidgin spoken by approximately 120M speakers and it is a mixed language (e.g., English, Portuguese, Yoruba, Hausa and Igbo). Although it has mainly been a spoken language until recently, there are some online platforms (e.g., Wikipedia), publishing in written Naija as well. West African Pidgin English (WAPE) is also spoken in Nigeria and it is used by BBC to broadcast news on the internet to a wider audience not only in Nigeria but also in other West African countries (e.g., Cameroon and Ghana). Through statistical analyses and Machine Translation experiments, our paper shows that these two pidgin varieties do not represent each other (i.e., there are linguistic differences in word order and vocabulary) and Generative AI operates only based on WAPE. In other words, Naija is underrepresented in Generative AI, and it is hard to teach LLMs with few examples. In addition to the statistical analyses, we also provide historical information on both pidgins as well as insights from the interviews conducted with volunteer Wikipedia contributors in Naija.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes