AINov 25, 2025

Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy

arXiv:2511.19872v21 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of evaluating self-assessment reliability in AI for researchers, but it is incremental as it applies an existing psychometric method to LLMs without achieving calibrated performance.

The study adapted a psychometric self-efficacy scale to elicit simulated self-assessments from ten large language models across different tasks, finding that self-assessment scores were stable but did not reliably reflect actual task performance, with models showing lower self-efficacy than humans and varying accuracy in summarization.

Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes