AIMay 19

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

arXiv:2605.1952971.6
Predicted impact top 48% in AI · last 90 daysOriginality Incremental advance
AI Analysis

The paper addresses the problem of self-referential validation loops in LLM-based adaptive assessments for educational measurement researchers and practitioners.

The paper introduces Generative-Evaluative Agreement (GEA) as a validity criterion for LLM-enabled adaptive assessments, measuring whether an LLM's scoring recovers intended skill levels. In a two-stage adaptive assessment, the model recovered roughly half the intended variance (r = 0.698) with systematic positive bias, showing strong GEA for syntactically verifiable skills but near-zero for design-level skills.

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes