CY AIApr 2

What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling

arXiv:2604.1641389.1h-index: 2

AI Analysis

For computational social scientists using LLMs for annotation, this work highlights a methodological reliability issue and offers a framework to measure and mitigate it.

LLM-based annotation in social science suffers from instability under prompt variation. The paper introduces Inter-Prompt Reliability (IPR) and shows that interpretative tasks exhibit substantial stochastic variation, while knowledge-based tasks are more stable; majority voting across prompts improves reproducibility.

Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts. Drawing on Inter-Rater Reliability, IPR is measured by Pairwise Agreement Rate (PAR) and its distribution to capture both consistency and stochasticity in model behavior. We evaluate this framework on two tasks with distinct properties: TREC (interpretative) and Politifact (knowledge-anchored). Results show that LLM annotation exhibits substantial stochastic variation in interpretative tasks, while appearing more stable in knowledge-based tasks. We further show that majority voting across prompts significantly improves reproducibility and reduces variance. These findings suggest that LLM prompt acts as an instrumental measurement while its wording exhibits methodological uncertainty. For future LLM-based CSS studies, we suggest that researchers move beyond single-prompt evaluation toward distributional stability and prompt aggregation within our IPR framework.

View on arXiv PDF

Similar