Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
The paper addresses the problem of self-referential validation loops in LLM-based adaptive assessments for educational measurement researchers and practitioners.
The paper introduces Generative-Evaluative Agreement (GEA) as a validity criterion for LLM-enabled adaptive assessments, measuring whether an LLM's scoring recovers intended skill levels. In a two-stage adaptive assessment, the model recovered roughly half the intended variance (r = 0.698) with systematic positive bias, showing strong GEA for syntactically verifiable skills but near-zero for design-level skills.
When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.