CLMay 26

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora

Idris Abdulmumin, Mokgadi Penelope Matloga, Tadesse Destaw Belay, Botshelo Kondowe, Letlhogonolo Mohleleng, Hareaipha Nkopo Letsoalo, Shamsuddeen Hassan Muhammad, Vukosi Marivate

arXiv:2605.2723929.9

AI Analysis

For NLP practitioners creating sentiment corpora, especially in low-resource languages, this work identifies temporal simultaneity as a key factor for annotation quality, but the findings are incremental as they confirm known issues of annotator drift.

The authors present a Setswana sentiment dataset and find that inter-annotator agreement declines over time, with temporal simultaneity (annotations made within one minute) being the dominant predictor of high agreement (κ=0.98 vs. 0.65 for annotations more than a day apart). They also benchmark models, with GPT-5 few-shot achieving 62.2 macro-F1.

Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.

View on arXiv PDF

Similar