CLMar 26, 2025

Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes, Versions, and Parameters

Mahmoud Alwakeel, Emory Buck, Jonathan G. Martin, Imran Aslam, Sudarshan Rajagopal, Jian Pei, Mihai V. Podgoreanu, Christopher J. Lindsell, An-Kwok Ian Wong

arXiv:2503.21004v32.7h-index: 2AMAI@MICCAI

Originality Incremental advance

AI Analysis

This addresses the resource-intensive manual abstraction in clinical registries, offering a scalable solution for healthcare researchers, though it is incremental as it applies existing LLMs to a new domain.

The study tackled the problem of automating concept extraction from radiology reports for pulmonary embolism registries, finding that large language models achieved high accuracy, with mean accuracy up to 0.96 for some variants and dual-model concordance maintaining PPV >= 0.90 and NPV >= 0.95 for key concepts.

Pulmonary embolism (PE) registries accelerate practice-improving research but depend on resource-intensive manual abstraction of radiology reports. We evaluated whether openly available large-language models (LLMs) can automate concept extraction from computed-tomography PE (CTPE) reports without sacrificing data quality. Four Llama-3 (L3) variants (3.0 8 B, 3.1 8 B, 3.1 70 B, 3.3 70 B) and two reviewer models Phi-4 (P4) 14 B and Gemma-3 27 B (G3) were tested on 250 dual-annotated CTPE reports each from MIMIC-IV and Duke University. Outcomes were accuracy, positive predictive value (PPV), and negative predictive value (NPV) versus a human gold standard across model sizes, temperature settings, and shot counts. Mean accuracy across all concepts increased with scale: 0.83 (L3-0 8 B), 0.91 (L3-1 8 B), and 0.96 for both 70 B variants; P4 14 B achieved 0.98; G3 matched. Accuracy differed by < 0.03 between datasets, underscoring external robustness. In dual-model concordance analysis (L3 70 B + P4 14 B), PE-presence PPV was >= 0.95 and NPV >= 0.98, while location, thrombus burden, right-heart strain, and image-quality artifacts each maintained PPV >= 0.90 and NPV >= 0.95. Fewer than 4% of individual concept annotations were discordant, and complete agreement was observed in more than 75% of reports. G3 performed comparably. LLMs therefore offer a scalable, accurate solution for PE registry abstraction, and a dual-model review workflow can further safeguard data quality with minimal human oversight.

View on arXiv PDF

Similar