AIMay 19

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh

arXiv:2605.1952971.6

Predicted impact top 48% in AI · last 90 daysOriginality Incremental advance

AI Analysis

The paper addresses the problem of self-referential validation loops in LLM-based adaptive assessments for educational measurement researchers and practitioners.

The paper introduces Generative-Evaluative Agreement (GEA) as a validity criterion for LLM-enabled adaptive assessments, measuring whether an LLM's scoring recovers intended skill levels. In a two-stage adaptive assessment, the model recovered roughly half the intended variance (r = 0.698) with systematic positive bias, showing strong GEA for syntactically verifiable skills but near-zero for design-level skills.

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.

View on arXiv PDF

Similar