AINov 25, 2025

Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy

Daniel I Jackson, Emma L Jensen, Syed-Amad Hussain, Emre Sezgin

arXiv:2511.19872v21 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating self-assessment reliability in AI for researchers, but it is incremental as it applies an existing psychometric method to LLMs without achieving calibrated performance.

The study adapted a psychometric self-efficacy scale to elicit simulated self-assessments from ten large language models across different tasks, finding that self-assessment scores were stable but did not reliably reflect actual task performance, with models showing lower self-efficacy than humans and varying accuracy in summarization.

Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

View on arXiv PDF

Similar