AICYJan 5

Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs

arXiv:2601.01878v1h-index: 11
Originality Highly original
AI Analysis

This addresses a foundational issue in AI evaluation for researchers and practitioners by diagnosing a systemic validity illusion in socio-cognitive assessments.

The paper tackles the problem that socio-cognitive benchmarks for LLMs often fail to predict real-world behavior due to a lack of explicit theoretical grounding, and introduces the Theory Trace Card (TTC) as a documentation artifact to formalize evaluations and enhance interpretability without modifying benchmarks.

Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability's other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes