AIJul 11, 2024
Vox Populi, Vox AI? Using Language Models to Estimate German Public OpinionLeah von der Heyde, Anna-Carolina Haensch, Alexander Wenz
The recent development of large language models (LLMs) has spurred discussions about whether LLM-generated "synthetic samples" could complement or replace traditional surveys, considering their training data potentially reflects attitudes and behaviors prevalent in the population. A number of mostly US-based studies have prompted LLMs to mimic survey respondents, with some of them finding that the responses closely match the survey data. However, several contextual factors related to the relationship between the respective target population and LLM training data might affect the generalizability of such findings. In this study, we investigate the extent to which LLMs can estimate public opinion in Germany, using the example of vote choice. We generate a synthetic sample of personas matching the individual characteristics of the 2017 German Longitudinal Election Study respondents. We ask the LLM GPT-3.5 to predict each respondent's vote choice and compare these predictions to the survey-based estimates on the aggregate and subgroup levels. We find that GPT-3.5 does not predict citizens' vote choice accurately, exhibiting a bias towards the Green and Left parties. While the LLM captures the tendencies of "typical" voter subgroups, such as partisans, it misses the multifaceted factors swaying individual voter choices. By examining the LLM-based prediction of voting behavior in a new context, our study contributes to the growing body of research about the conditions under which LLMs can be leveraged for studying public opinion. The findings point to disparities in opinion representation in LLMs and underscore the limitations in applying them for public opinion estimation.
CYAug 29, 2024
United in Diversity? Contextual Biases in LLM-Based Predictions of the 2024 European Parliament ElectionsLeah von der Heyde, Anna-Carolina Haensch, Alexander Wenz et al.
"Synthetic samples" based on large language models (LLMs) have been argued to serve as efficient alternatives to surveys of humans, assuming that their training data includes information on human attitudes and behavior. However, LLM-synthetic samples might exhibit bias, for example due to training data and fine-tuning processes being unrepresentative of diverse contexts. Such biases risk reinforcing existing biases in research, policymaking, and society. Therefore, researchers need to investigate if and under which conditions LLM-generated synthetic samples can be used for public opinion prediction. In this study, we examine to what extent LLM-based predictions of individual public opinion exhibit context-dependent biases by predicting the results of the 2024 European Parliament elections. Prompting three LLMs with individual-level background information of 26,000 eligible European voters, we ask the LLMs to predict each person's voting behavior. By comparing them to the actual results, we show that LLM-based predictions of future voting behavior largely fail, their accuracy is unequally distributed across national and linguistic contexts, and they require detailed attitudinal information in the prompt. The findings emphasize the limited applicability of LLM-synthetic samples to public opinion prediction. In investigating their contextual biases, this study contributes to the understanding and mitigation of inequalities in the development of LLMs and their applications in computational social science.
CLJun 17, 2025
AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey MotivationLeah von der Heyde, Anna-Carolina Haensch, Bernd Weiß et al.
The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.