CLCVApr 28, 2025

A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports

arXiv:2504.20220v13 citationsh-index: 11Has CodeEMBC
Originality Synthesis-oriented
AI Analysis

This work addresses the time-consuming and error-prone process of digitizing paper-based clinical data for healthcare administrators, though it is incremental as it applies existing methods to a specific domain.

The study tackled the problem of manually transcribing checkbox data from paper documents like transfusion reaction reports by developing an open-source multimodal pipeline using vision-language models, achieving high precision and recall compared to gold-standard data from 2017 to 2024.

Despite the growing adoption of electronic health records, many processes still rely on paper documents, reflecting the heterogeneous real-world conditions in which healthcare is delivered. The manual transcription process is time-consuming and prone to errors when transferring paper-based data to digital formats. To streamline this workflow, this study presents an open-source pipeline that extracts and categorizes checkbox data from scanned documents. Demonstrated on transfusion reaction reports, the design supports adaptation to other checkbox-rich document types. The proposed method integrates checkbox detection, multilingual optical character recognition (OCR) and multilingual vision-language models (VLMs). The pipeline achieves high precision and recall compared against annually compiled gold-standards from 2017 to 2024. The result is a reduction in administrative workload and accurate regulatory reporting. The open-source availability of this pipeline encourages self-hosted parsing of checkbox forms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes