Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

arXiv:2605.0667378.5

Predicted impact top 74% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For LLM developers and deployers, this work reveals that aggregate metacognitive scores are insufficient for assessing model reliability across knowledge domains, supporting domain-specific screening before deployment.

This study measured domain-level metacognitive monitoring in 33 frontier LLMs using 1,500 MMLU items, finding that aggregate metrics mask substantial within-model variation across domains. Applied/Professional knowledge was easiest to monitor (mean AUROC = .742), while Formal Reasoning and Natural Science were hardest, with significant family-level clustering for Anthropic, Google-Gemini, and Qwen.

Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and computed Type-2 AUROC per model-domain cell using verbalized confidence (0-100). Total observations: 47,151. Every model with above-chance aggregate monitoring showed non-trivial domain-level variation. Applied/Professional knowledge was reliably the easiest benchmark domain to monitor (mean AUROC = .742, ranked top-2 in 21 of 33 models); Formal Reasoning and Natural Science were reliably the hardest (one of the two ranked bottom-2 in 27 of 33 models). The three middle domains were statistically indistinguishable (Kendall's W = .164). A subject-level coherence analysis (within-domain similarity ratio = 0.95) confirms the six-domain grouping is a pragmatic benchmark taxonomy, not a validated latent construct. Within-family profile-shape clustering is significant for Anthropic, Google-Gemini, and Qwen (permutation p < .0001) but not DeepSeek, Google-Gemma, or OpenAI. Gemma 4 31B showed a +.202 AUROC improvement over Gemma 3 27B. Three models classified Invalid on binary KEEP/WITHDRAW probes produced normal profiles under verbalized confidence, confirming probe-format specificity. Bootstrap 95% CIs on 198 cells have median width .199. Split-half aggregate stability r = .893; profile-level split-half is weaker (grand median r = .184). These results show stable benchmark-domain variation obscured by aggregate metrics, and support benchmark-stage domain screening as a step before deployment in specific application areas.

View on arXiv PDF

Similar