AICYJun 17, 2024

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

arXiv:2406.11980v127 citations
Originality Incremental advance
AI Analysis

This work provides a warning and practical guide for researchers and practitioners using LLMs for annotation in computational social science, highlighting the unpredictable impact of prompt design on data quality.

The study investigated how prompt design affects LLMs' compliance and accuracy in annotating computational social science tasks, finding that choices like numerical scores versus labels significantly reduce performance, with the best setup being task-dependent and minor changes causing large label distribution shifts.

Manually annotating data for computational social science tasks can be costly, time-consuming, and emotionally draining. While recent work suggests that LLMs can perform such annotation tasks in zero-shot settings, little is known about how prompt design impacts LLMs' compliance and accuracy. We conduct a large-scale multi-prompt experiment to test how model selection (ChatGPT, PaLM2, and Falcon7b) and prompt design features (definition inclusion, output type, explanation, and prompt length) impact the compliance and accuracy of LLM-generated annotations on four CSS tasks (toxicity, sentiment, rumor stance, and news frames). Our results show that LLM compliance and accuracy are highly prompt-dependent. For instance, prompting for numerical scores instead of labels reduces all LLMs' compliance and accuracy. The overall best prompting setup is task-dependent, and minor prompt changes can cause large changes in the distribution of generated labels. By showing that prompt design significantly impacts the quality and distribution of LLM-generated annotations, this work serves as both a warning and practical guide for researchers and practitioners.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes