CVAICLAug 10, 2021

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

arXiv:2108.04938v146 citations
Originality Incremental advance
AI Analysis

This addresses the domain gap in medical imaging for clinicians, offering a more data-efficient solution for thoracic disease diagnosis.

The paper tackles the challenge of applying vision-and-language models to medical data by proposing BERTHop, which improves disease diagnosis on chest X-rays, achieving an average AUC of 98.12% on the OpenI dataset, a 1.62% gain over SOTA with 9 times less training data.

Vision-and-language(V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V&L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the domain gap. In this paper, we investigate the challenges of applying pre-trained V&L models in medical applications. In particular, we identify that the visual representation in general V&L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT, for better capturing the associations between the two modalities. Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art (SOTA) while it is trained on a 9 times smaller dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes