A Confidence-Diversity Framework for Calibrating AI Judgement in Accessible Qualitative Coding Tasks
This provides a methodological foundation for calibrating AI judgement in accessible qualitative coding tasks, though it is incremental as it builds on existing LLM capabilities.
The paper tackles the problem of assessing reliability in LLM-based qualitative coding by proposing a confidence-diversity calibration framework, which reduces manual effort by 65% and auto-accepts 35% of segments with less than 5% error.
LLMs enable qualitative coding at large scale, but assessing reliability remains challenging where human experts seldom agree. We investigate confidence-diversity calibration as a quality assessment framework for accessible coding tasks where LLMs already demonstrate strong performance but exhibit overconfidence. Analysing 5,680 coding decisions from eight state-of-the-art LLMs across ten categories, we find that mean self-confidence tracks inter-model agreement closely (Pearson r=0.82). Adding model diversity quantified as normalised Shannon entropy produces a dual signal explaining agreement almost completely (R-squared=0.979), though this high predictive power likely reflects task simplicity for current LLMs. The framework enables a three-tier workflow auto-accepting 35 percent of segments with less than 5 percent error, cutting manual effort by 65 percent. Cross-domain validation confirms transferability (kappa improvements of 0.20 to 0.78). While establishing a methodological foundation for AI judgement calibration, the true potential likely lies in more challenging scenarios where LLMs may demonstrate comparative advantages over human cognitive limitations.