CLAICYHCMay 12, 2024

Limited Ability of LLMs to Simulate Human Psychological Behaviours: a Psychometric Analysis

arXiv:2405.07248v141 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses the problem for social scientists and AI researchers of using LLMs as proxies for human participants in experiments, revealing significant limitations in simulating individual-level human behavior.

The study investigated whether large language models (LLMs) like GPT-3.5 and GPT-4 can simulate human psychological behaviors by responding to standardized personality questionnaires, finding that GPT-4 with generic personas showed promising but imperfect psychometric properties, while both models performed poorly with specific demographic profiles.

The humanlike responses of large language models (LLMs) have prompted social scientists to investigate whether LLMs can be used to simulate human participants in experiments, opinion polls and surveys. Of central interest in this line of research has been mapping out the psychological profiles of LLMs by prompting them to respond to standardized questionnaires. The conflicting findings of this research are unsurprising given that mapping out underlying, or latent, traits from LLMs' text responses to questionnaires is no easy task. To address this, we use psychometrics, the science of psychological measurement. In this study, we prompt OpenAI's flagship models, GPT-3.5 and GPT-4, to assume different personas and respond to a range of standardized measures of personality constructs. We used two kinds of persona descriptions: either generic (four or five random person descriptions) or specific (mostly demographics of actual humans from a large-scale human dataset). We found that the responses from GPT-4, but not GPT-3.5, using generic persona descriptions show promising, albeit not perfect, psychometric properties, similar to human norms, but the data from both LLMs when using specific demographic profiles, show poor psychometrics properties. We conclude that, currently, when LLMs are asked to simulate silicon personas, their responses are poor signals of potentially underlying latent traits. Thus, our work casts doubt on LLMs' ability to simulate individual-level human behaviour across multiple-choice question answering tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes