CLAIMar 1

Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains

arXiv:2603.00924v1h-index: 3
Originality Synthesis-oriented
AI Analysis

This work addresses safe deployment of LLMs in clinical settings by providing domain-specific calibration for medical entity extraction, though it is incremental as it adapts existing conformal prediction methods to new domains.

The study tackled the problem of miscalibrated confidence scores in LLMs for medical entity extraction by applying a conformal prediction framework across two clinical domains, achieving target coverage of at least 90% with manageable rejection rates of 9-13%.

Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($τ\approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($τ$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes