CLJun 25, 2024

Evaluating Large Language Models with Psychometrics

arXiv:2406.17675v227 citations
AI Analysis

This work addresses the need for reliable evaluation of LLMs' psychological behaviors, which is crucial for AI and social sciences applications, though it is incremental in applying psychometric methods to AI.

The paper tackled the problem of evaluating whether large language models exhibit consistent psychological patterns by developing a comprehensive benchmark to quantify five key psychological constructs across 13 datasets. The result revealed significant discrepancies between LLMs' self-reported traits and their real-world response patterns, showing that some human-designed tests fail to elicit reliable responses from LLMs.

Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in whether they exhibit psychological patterns, and whether these patterns remain consistent across different contexts -- questions that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a {comprehensive benchmark for quantifying psychological constructs of LLMs}, encompassing psychological dimension identification, assessment dataset design, and assessment with results validation. Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets featuring diverse scenarios and item types. We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes