CLApr 7

YoNER: A New YorÃ¹bÃ¡ Multi-domain Named Entity Recognition Dataset

Peace Busola Falola, Jesujoba O. Alabi, Solomon O. Akinola, Folashade T. Ogunajo, Emmanuel Oluwadunsin Alabi, David Ifeoluwa Adelani

arXiv:2604.0562424.7h-index: 33

Predicted impact top 31% in CL · last 90 daysOriginality Synthesis-oriented

AI Analysis

This addresses the problem of limited NLP resources for Yorùbá, a low-resource language, by providing a new dataset and model, though it is incremental as it builds on existing NER frameworks.

The authors tackled the lack of diverse datasets for Yorùbá Named Entity Recognition by creating YoNER, a multi-domain dataset with about 5,000 sentences and 100,000 tokens from five domains, and introduced OyoBERT, a Yorùbá-specific language model that outperforms multilingual models in in-domain evaluation.

Named Entity Recognition (NER) is a foundational NLP task, yet research in YorÃ¹bÃ¡ has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain YorÃ¹bÃ¡ NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native YorÃ¹bÃ¡ speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for YorÃ¹bÃ¡, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new YorÃ¹bÃ¡-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on YorÃ¹bÃ¡ natural language processing.

View on arXiv PDF

Similar