CVAIJun 7, 2023

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation

arXiv:2306.04811v161 citationsh-index: 49
Originality Incremental advance
AI Analysis

This addresses the problem of limited paired text data for medical professionals in 3D image segmentation, though it is incremental as it extends existing VLP methods to a new domain.

The paper tackles the challenge of applying Vision-Language Pretraining (VLP) to 3D medical images without paired text by generating synthetic text using large language models and using negative-free contrastive learning, achieving superior performance across 13 datasets in CT, MRI, and EM segmentation tasks.

Vision-Language Pretraining (VLP) has demonstrated remarkable capabilities in learning visual representations from textual descriptions of images without annotations. Yet, effective VLP demands large-scale image-text pairs, a resource that suffers scarcity in the medical domain. Moreover, conventional VLP is limited to 2D images while medical images encompass diverse modalities, often in 3D, making the learning process more challenging. To address these challenges, we present Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation (GTGM), a framework that extends of VLP to 3D medical images without relying on paired textual descriptions. Specifically, GTGM utilizes large language models (LLM) to generate medical-style text from 3D medical images. This synthetic text is then used to supervise 3D visual representation learning. Furthermore, a negative-free contrastive learning objective strategy is introduced to cultivate consistent visual representations between augmented 3D medical image patches, which effectively mitigates the biases associated with strict positive-negative sample pairings. We evaluate GTGM on three imaging modalities - Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and electron microscopy (EM) over 13 datasets. GTGM's superior performance across various medical image segmentation tasks underscores its effectiveness and versatility, by enabling VLP extension into 3D medical imagery while bypassing the need for paired text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes