Morgan Wack

25.2HCMar 11

The Laziness of the Crowd: Effort Aversion Among Raters Risks Undermining the Efficacy of X's Community Notes Program

Morgan Wack, Patrick Warren, Mustafa Alam

Crowdsourced moderation systems like Twitter/X's Community Notes program have been proposed as scalable alternatives to professional fact-checkers for combating online misinformation. While prior research has examined the effectiveness of such systems in reducing engagement with false content and their vulnerability to partisan bias, we identify a previously untested mechanism linking fact-check difficulty to systematic non-participation by crowdsourced raters. We hypothesize that claims requiring less cognitive effort to evaluate, specifically, those that are obviously false and easy to refute, are more likely to receive public notes than claims that are more plausible and require greater effort to debunk. Using eighteen months of vaccine-related Community Notes data (2,250 posts) and ratings from 382 survey participants, we show that claims perceived as more difficult to fact-check are significantly less likely to receive notes that achieve ``helpful''/public status. Following the conduct of additional analyses and a fact-checking process utilizing an LLM pipeline to help rule out alternative explanations, we interpret this pattern as consistent with an unwillingness among raters to invest the mental effort required to evaluate and rate notes for more plausible misinformation. These findings suggest that crowdsourced moderation may systematically fail to address the forms of plausible misinformation which are most likely to deceive. We discuss implications for platform design and propose mechanisms to mitigate this difficulty penalty in crowdsourced content moderation systems.

92.7CLApr 4

Researchers waste 80% of LLM annotation costs by classifying one text at a time

Christian Pipal, Eva-Maria Vogel, Morgan Wack et al.

Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length. Within this safe operating range, the measurement error from batching and stacking is smaller than typical inter-coder disagreement in the ground-truth data.

Morgan Wack

2 Papers