CVJul 16, 2025

Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

arXiv:2507.12236v18.42 citationsh-index: 7Has Code

Originality Highly original

AI Analysis

This addresses disease localization in medical imaging through clinical reports, offering a more effective paradigm for interpretable applications in clinical practice.

The paper tackles phrase grounding in medical imaging by showing that generative text-to-image diffusion models with cross-attention maps outperform current discriminative methods, achieving mIoU scores that double those of existing approaches, and further improves performance with a novel post-processing technique called Bimodal Bias Merging.

Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.

View on arXiv PDF Code

Similar