CLAINov 28, 2025

Mind Reading or Misreading? LLMs on the Big Five Personality Test

arXiv:2511.23101v1Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of unreliable personality prediction using LLMs for researchers and practitioners, highlighting incremental improvements in prompt design and evaluation.

The study evaluated large language models (LLMs) for automatic personality prediction from text using the Big Five model, finding that enriched prompts improved outputs but introduced bias, with performance varying by trait and no configuration achieving reliable zero-shot predictions.

We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes