Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora
For NLP practitioners creating sentiment corpora, especially in low-resource languages, this work identifies temporal simultaneity as a key factor for annotation quality, but the findings are incremental as they confirm known issues of annotator drift.
The authors present a Setswana sentiment dataset and find that inter-annotator agreement declines over time, with temporal simultaneity (annotations made within one minute) being the dominant predictor of high agreement (κ=0.98 vs. 0.65 for annotations more than a day apart). They also benchmark models, with GPT-5 few-shot achieving 62.2 macro-F1.
Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph's free-marginal Kappa of $κ= 0.76$, "excellent," per-batch $κ$ falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of $κ$ is temporal simultaneity: tweets labeled within one minute achieve $κ= 0.98$, while those labeled more than a day apart reach only $κ= 0.65$. Annotation speed and tweet-level linguistic features show no meaningful association with $κ$. We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.