CLOct 3, 2025

What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification

arXiv:2510.03541v14.91 citationsh-index: 2

Originality Incremental advance

AI Analysis

This addresses a critical methodological issue for computational social science researchers, highlighting an overlooked step in LLM-era classification that can lead to biased inferences.

The paper tackles the problem of conceptualization errors in text classification using large language models (LLMs) in computational social science, showing through simulations that these errors bias downstream estimates and cannot be fixed by improving LLM accuracy or post-hoc methods.

Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, focus on the steps before and after LLM prompting -- conceptualization of concepts to be classified and using LLM predictions in downstream statistical inference -- which we argue have been overlooked in much of LLM-era CSS. We claim LLMs can tempt analysts to skip the conceptualization step, creating conceptualization errors that bias downstream estimates. Using simulations, we show that this conceptualization-induced bias cannot be corrected for solely by increasing LLM accuracy or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice on how to pursue low-cost, unbiased, low-variance downstream estimates.

View on arXiv PDF

Similar