CL CYAug 6, 2025

Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

Kezia Oketch, John P. Lalor, Ahmed Abbasi

arXiv:2508.14051v12.71 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This addresses the problem of culturally biased NLP evaluation for Swahili speakers, though it is incremental as it focuses on a specific domain and dataset.

The paper tackled the lack of sociolinguistic diversity in Swahili NLP by collecting a dataset of 2,170 free-text responses from Kenyan speakers and using a taxonomy to analyze model errors, revealing how tribal influences and code-mixing affect performance.

We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.

View on arXiv PDF

Similar