CV LGOct 11, 2023

IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

Che Liu, Sibo Cheng, Miaojing Shi, Anand Shah, Wenjia Bai, Rossella Arcucci

arXiv:2310.07355v515.741 citationsh-index: 30Has Code

Originality Incremental advance

AI Analysis

This work addresses a domain-specific bottleneck in medical AI by improving vision-language alignment for tasks like chest X-ray analysis, though it is incremental as it builds on existing VLP methods with structured report integration.

The paper tackled the problem of overlooking the hierarchical structure in clinical reports for medical vision-language pre-training by proposing IMITATE, a framework that aligns multi-level visual features with descriptive and conclusive text, resulting in outperforming baseline methods across six datasets and five downstream tasks.

In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment. The code related to this paper is available at https://github.com/cheliu-computation/IMITATE-TMI2024.

View on arXiv PDF Code

Similar