CLLGOct 8, 2025

Quantifying Data Contamination in Psychometric Evaluations of LLMs

arXiv:2510.07175v14 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses a reliability issue for researchers and practitioners using psychometric tests on LLMs, though it is incremental as it quantifies a known concern.

The paper tackled the problem of data contamination in psychometric evaluations of LLMs by proposing a framework to measure it, finding strong contamination in popular inventories like BFI-44 and PVQ-40, where models memorize items and adjust responses to achieve target scores.

Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes