Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
This work addresses the challenge of assessing cognitive capabilities in AI for researchers, but it is incremental as it builds on existing evaluation methods with a new framework.
The researchers tackled the problem of evaluating theory of mind and world modeling in large language models by introducing StorySim, a framework for generating synthetic stories, and found that models perform better on world modeling than theory of mind tasks, with evidence of heuristic behaviors like recency bias.
We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.