AIAug 9, 2025

Large Language Models Do Not Simulate Human Psychology

Sarah Schröder, Thekla Morgenroth, Ulrike Kuhl, Valerie Vaquet, Benjamin Paaßen

arXiv:2508.06950v320.818 citationsh-index: 20

Originality Synthesis-oriented

AI Analysis

This work addresses the problem for psychological researchers by highlighting the unreliability of LLMs in simulating human behavior, making it an incremental contribution that refines existing cautionary perspectives.

The paper argues that large language models (LLMs) do not simulate human psychology, cautioning against their use as replacements for human participants in psychological studies, and provides empirical evidence showing discrepancies in responses to wording changes and lack of reliability across models.

Large Language Models (LLMs),such as ChatGPT, are increasingly used in research, ranging from simple writing assistance to complex data annotation tasks. Recently, some research has suggested that LLMs may even be able to simulate human psychology and can, hence, replace human participants in psychological studies. We caution against this approach. We provide conceptual arguments against the hypothesis that LLMs simulate human psychology. We then present empiric evidence illustrating our arguments by demonstrating that slight changes to wording that correspond to large changes in meaning lead to notable discrepancies between LLMs' and human responses, even for the recent CENTAUR model that was specifically fine-tuned on psychological responses. Additionally, different LLMs show very different responses to novel items, further illustrating their lack of reliability. We conclude that LLMs do not simulate human psychology and recommend that psychological researchers should treat LLMs as useful but fundamentally unreliable tools that need to be validated against human responses for every new application.

View on arXiv PDF

Similar