CLAIFeb 18

Are LLMs Ready to Replace Bangla Annotators?

arXiv:2602.16241v2h-index: 1
AI Analysis

This work addresses the reliability of LLMs for sensitive annotation tasks in low-resource languages, highlighting limitations that could impact dataset creation and deployment in identity-sensitive settings.

The study evaluated 17 large language models as zero-shot annotators for Bangla hate speech, finding that they exhibit annotator bias and instability, with smaller, task-aligned models often performing more consistently than larger ones.

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes