Automated Cardiovascular Record Retrieval by Multimodal Learning between Electrocardiogram and Clinical Report
This addresses the problem of limited clinical ECG interpretation resources, particularly in underdeveloped regions, by offering a retrieval-based alternative to classification tasks.
The paper tackles automated ECG interpretation by proposing a multimodal learning approach that retrieves similar clinical cases based on ECG data, processing ECGs as encoded images and aligning them with diagnostic reports using vision-language models, which could provide diagnostic services in underdeveloped regions.
Automated interpretation of electrocardiograms (ECG) has garnered significant attention with the advancements in machine learning methodologies. Despite the growing interest, most current studies focus solely on classification or regression tasks, which overlook a crucial aspect of clinical cardio-disease diagnosis: the diagnostic report generated by experienced human clinicians. In this paper, we introduce a novel approach to ECG interpretation, leveraging recent breakthroughs in Large Language Models (LLMs) and Vision-Transformer (ViT) models. Rather than treating ECG diagnosis as a classification or regression task, we propose an alternative method of automatically identifying the most similar clinical cases based on the input ECG data. Also, since interpreting ECG as images is more affordable and accessible, we process ECG as encoded images and adopt a vision-language learning paradigm to jointly learn vision-language alignment between encoded ECG images and ECG diagnosis reports. Encoding ECG into images can result in an efficient ECG retrieval system, which will be highly practical and useful in clinical applications. More importantly, our findings could serve as a crucial resource for providing diagnostic services in underdeveloped regions.