CLJun 3

Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection

arXiv:2606.0492422.7
Predicted impact top 45% in CL · last 90 daysOriginality Synthesis-oriented
AI Analysis

For NLP researchers relying on crowdsourced data, this paper identifies the scale and challenges of LLM contamination, but is primarily a survey of opinions rather than a solution.

A survey of 155 NLP researchers found that 44% observed LLM usage in crowdsourced data, with most aware but unsure how to mitigate, highlighting insufficient current measures.

The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes