XToM: Exploring the Multilingual Theory of Mind for Large Language Models
This addresses the gap in multilingual ToM evaluation for LLMs, which is incremental as it extends existing English-focused benchmarks to diverse linguistic contexts.
The paper tackled the problem of evaluating Theory of Mind (ToM) in large language models (LLMs) across multiple languages, revealing that while models excel in multilingual understanding, their ToM performance varies significantly across languages, exposing limitations in replicating human-like mentalizing.
Theory of Mind (ToM), the ability to infer mental states in others, is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind, which is the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.