CLJun 27, 2024

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

arXiv:2406.19564v116 citations
Originality Incremental advance
AI Analysis

This addresses resource disparities for Yorùbá dialect speakers, with potential broader impact for African languages, though it is incremental as it builds on existing NLP efforts.

The authors tackled the lack of NLP resources for Yorùbá regional dialects by creating a high-quality parallel text and speech corpus across four dialects, revealing substantial performance disparities compared to standard Yorùbá but showing that dialect-adaptive finetuning can narrow this gap.

Yorùbá an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YORÙLECT across three domains and four regional Yorùbá dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yorùbá and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yorùbá and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YORÙLECT dataset and models publicly under an open license.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes