LGDec 11, 2025

Metacognitive Sensitivity for Test-Time Dynamic Model Selection

arXiv:2512.10451v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the issue of unreliable confidence estimates in AI models for researchers and practitioners, offering a novel approach to ensemble selection, though it is incremental as it builds on existing cognitive science concepts.

The authors tackled the problem of poor calibration in deep learning models by proposing a metacognitive framework that uses meta-d' to measure how reliably a model's confidence predicts its accuracy, and they applied this to test-time model selection, resulting in improved joint-inference accuracy across multiple datasets and model combinations.

A key aspect of human cognition is metacognition - the ability to assess one's own knowledge and judgment reliability. While deep learning models can express confidence in their predictions, they often suffer from poor calibration, a cognitive bias where expressed confidence does not reflect true competence. Do models truly know what they know? Drawing from human cognitive science, we propose a new framework for evaluating and leveraging AI metacognition. We introduce meta-d', a psychologically-grounded measure of metacognitive sensitivity, to characterise how reliably a model's confidence predicts its own accuracy. We then use this dynamic sensitivity score as context for a bandit-based arbiter that performs test-time model selection, learning which of several expert models to trust for a given task. Our experiments across multiple datasets and deep learning model combinations (including CNNs and VLMs) demonstrate that this metacognitive approach improves joint-inference accuracy over constituent models. This work provides a novel behavioural account of AI models, recasting ensemble selection as a problem of evaluating both short-term signals (confidence prediction scores) and medium-term traits (metacognitive sensitivity).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes