AI CYJun 17, 2024

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, Libby Hemphill

arXiv:2406.11980v115.427 citations

Originality Incremental advance

AI Analysis

This work provides a warning and practical guide for researchers and practitioners using LLMs for annotation in computational social science, highlighting the unpredictable impact of prompt design on data quality.

The study investigated how prompt design affects LLMs' compliance and accuracy in annotating computational social science tasks, finding that choices like numerical scores versus labels significantly reduce performance, with the best setup being task-dependent and minor changes causing large label distribution shifts.

Manually annotating data for computational social science tasks can be costly, time-consuming, and emotionally draining. While recent work suggests that LLMs can perform such annotation tasks in zero-shot settings, little is known about how prompt design impacts LLMs' compliance and accuracy. We conduct a large-scale multi-prompt experiment to test how model selection (ChatGPT, PaLM2, and Falcon7b) and prompt design features (definition inclusion, output type, explanation, and prompt length) impact the compliance and accuracy of LLM-generated annotations on four CSS tasks (toxicity, sentiment, rumor stance, and news frames). Our results show that LLM compliance and accuracy are highly prompt-dependent. For instance, prompting for numerical scores instead of labels reduces all LLMs' compliance and accuracy. The overall best prompting setup is task-dependent, and minor prompt changes can cause large changes in the distribution of generated labels. By showing that prompt design significantly impacts the quality and distribution of LLM-generated annotations, this work serves as both a warning and practical guide for researchers and practitioners.

View on arXiv PDF

Similar