Evaluating how LLM annotations represent diverse views on contentious topics
This addresses bias in automated data annotation for researchers and practitioners, highlighting that model choice alone is insufficient for fairness, though it is incremental as it builds on prior bias research.
The study evaluated whether large language models (LLMs) exhibit bias by disproportionately aligning with majority viewpoints when annotating contentious topics, finding that LLMs show consistent biases across demographic categories and that human annotator disagreement (item difficulty) is a stronger predictor of LLM agreement than demographics.
Researchers have proposed the use of generative large language models (LLMs) to label data for research and applied settings. This literature emphasizes the improved performance of these models relative to other natural language models, noting that generative LLMs typically outperform other models and even humans across several metrics. Previous literature has examined bias across many applications and contexts, but less work has focused specifically on bias in generative LLMs' responses to subjective annotation tasks. This bias could result in labels applied by LLMs that disproportionately align with majority groups over a more diverse set of viewpoints. In this paper, we evaluate how LLMs represent diverse viewpoints on these contentious tasks. Across four annotation tasks on four datasets, we show that LLMs do not show systematic substantial disagreement with annotators on the basis of demographics. Rather, we find that multiple LLMs tend to be biased in the same directions on the same demographic categories within the same datasets. Moreover, the disagreement between human annotators on the labeling task -- a measure of item difficulty -- is far more predictive of LLM agreement with human annotators. We conclude with a discussion of the implications for researchers and practitioners using LLMs for automated data annotation tasks. Specifically, we emphasize that fairness evaluations must be contextual, model choice alone will not solve potential issues of bias, and item difficulty must be integrated into bias assessments.