CLJul 18, 2025

Can LLMs Infer Personality from Real World Conversations?

arXiv:2507.14355v12 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of scalable personality assessment for psychological applications, but it is incremental as it highlights limitations rather than achieving breakthroughs.

The researchers tackled the problem of using Large Language Models (LLMs) to infer personality traits from real-world conversations, finding that while models showed high test-retest reliability, their construct validity was limited with weak correlations to ground-truth scores (max Pearson's r = 0.27) and low interrater agreement (Cohen's κ < 0.10).

Large Language Models (LLMs) such as OpenAI's GPT-4 and Meta's LLaMA offer a promising approach for scalable personality assessment from open-ended language. However, inferring personality traits remains challenging, and earlier work often relied on synthetic data or social media text lacking psychometric validity. We introduce a real-world benchmark of 555 semi-structured interviews with BFI-10 self-report scores for evaluating LLM-based personality inference. Three state-of-the-art LLMs (GPT-4.1 Mini, Meta-LLaMA, and DeepSeek) were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference. All models showed high test-retest reliability, but construct validity was limited: correlations with ground-truth scores were weak (max Pearson's $r = 0.27$), interrater agreement was low (Cohen's $κ< 0.10$), and predictions were biased toward moderate or high trait levels. Chain-of-thought prompting and longer input context modestly improved distributional alignment, but not trait-level accuracy. These results underscore limitations in current LLM-based personality inference and highlight the need for evidence-based development for psychological applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes