CVMar 2, 2023

ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Zachary Huemann, Xin Tie, Junjie Hu, Tyler J. Bradshaw

arXiv:2303.01615v210.432 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of improving medical image segmentation accuracy for pneumothorax diagnosis by leveraging multimodal data, though it is incremental as it builds on existing vision-language methods.

The paper tackled pneumothorax segmentation on chest radiographs by proposing ConTEXTual Net, a vision-language model that uses radiology report text to guide the process, achieving a Dice score of 0.716±0.016, which matched inter-reader variability and outperformed baseline models.

Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net utilizes language features extracted from corresponding free-form radiology reports using a pre-trained language model. Cross-attention modules are designed to combine the intermediate output of each vision encoder layer and the text embeddings generated by the language model. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3,196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716$\pm$0.016, which was similar to the degree of inter-reader variability (0.712$\pm$0.044) computed on a subset of the data. It outperformed both vision-only models (ResNet50 U-Net: 0.677$\pm$0.015 and GLoRIA: 0.686$\pm$0.014) and a competing vision-language model (LAVT: 0.706$\pm$0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.

View on arXiv PDF Code

Similar