CVCLApr 11, 2023

ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

arXiv:2304.05303v21 citationsh-index: 12
Originality Incremental advance
AI Analysis

This work addresses the need for accurate localization in computer-aided diagnosis for radiologists, representing an incremental improvement over prior locality-aware methods.

The paper tackles the problem of preserving spatial relationships in visual language pre-training for chest X-ray analysis, proposing ELVIS, which significantly outperforms state-of-the-art baselines in segmentation and phrase grounding tasks.

Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes