CV AIOct 14, 2025

A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

Shurong Chai, Rahul Kumar JAIN, Rui Xu, Shaocong Mo, Ruibo Hou, Shiyu Teng, Jiaqing Liu, Lanfen Lin, Yen-Wei Chen

arXiv:2510.12482v13.61 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in multimodal medical image segmentation for researchers and practitioners, though it is incremental in nature.

The paper tackles the problem of data augmentation disrupting spatial alignment between text and images in referring medical image segmentation by proposing an early fusion framework that combines features before augmentation, achieving state-of-the-art results on three medical imaging tasks and four segmentation frameworks.

Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.

View on arXiv PDF Code

Similar