CLJul 18, 2025

Can LLMs Infer Personality from Real World Conversations?

Jianfeng Zhu, Ruoming Jin, Karin G. Coifman

arXiv:2507.14355v18.32 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of scalable personality assessment for psychological applications, but it is incremental as it highlights limitations rather than achieving breakthroughs.

The researchers tackled the problem of using Large Language Models (LLMs) to infer personality traits from real-world conversations, finding that while models showed high test-retest reliability, their construct validity was limited with weak correlations to ground-truth scores (max Pearson's r = 0.27) and low interrater agreement (Cohen's κ < 0.10).

Large Language Models (LLMs) such as OpenAI's GPT-4 and Meta's LLaMA offer a promising approach for scalable personality assessment from open-ended language. However, inferring personality traits remains challenging, and earlier work often relied on synthetic data or social media text lacking psychometric validity. We introduce a real-world benchmark of 555 semi-structured interviews with BFI-10 self-report scores for evaluating LLM-based personality inference. Three state-of-the-art LLMs (GPT-4.1 Mini, Meta-LLaMA, and DeepSeek) were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference. All models showed high test-retest reliability, but construct validity was limited: correlations with ground-truth scores were weak (max Pearson's $r = 0.27$), interrater agreement was low (Cohen's $κ< 0.10$), and predictions were biased toward moderate or high trait levels. Chain-of-thought prompting and longer input context modestly improved distributional alignment, but not trait-level accuracy. These results underscore limitations in current LLM-based personality inference and highlight the need for evidence-based development for psychological applications.

View on arXiv PDF

Similar