CLCYAug 6, 2025

Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

arXiv:2508.14051v11 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This addresses the problem of culturally biased NLP evaluation for Swahili speakers, though it is incremental as it focuses on a specific domain and dataset.

The paper tackled the lack of sociolinguistic diversity in Swahili NLP by collecting a dataset of 2,170 free-text responses from Kenyan speakers and using a taxonomy to analyze model errors, revealing how tribal influences and code-mixing affect performance.

We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes