CLJun 17, 2025

MIST: Towards Multi-dimensional Implicit BiaS Evaluation of LLMs via Theory of Mind

arXiv:2506.14161v2h-index: 6
Originality Incremental advance
AI Analysis

This provides a more robust methodology for identifying structural implicit bias in LLMs, addressing a domain-specific problem in AI ethics and evaluation.

The paper tackled the challenge of evaluating implicit bias in Large Language Models (LLMs) related to Theory of Mind by proposing a multi-dimensional framework based on the Stereotype Content Model, which revealed pervasive sociability bias and asymmetric stereotype amplification across 8 state-of-the-art models.

Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes