Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings
This work addresses the need for principled evaluation of LLM-human alignment in social science applications, highlighting risks in naive simulations for researchers in fields like economics and marketing.
The authors tackled the problem of assessing how well large language models replicate human behavior in social science research by developing a quantitative framework using hypothesis testing to measure misalignment in multiple-choice settings. They found that a popular model was ill-suited for simulating opinions across diverse sub-populations for contentious questions, raising concerns about its alignment.
As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.