CVJun 12, 2025

Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models

arXiv:2506.10633v11 citationsh-index: 50Has Code
Originality Incremental advance
AI Analysis

This work addresses a critical bottleneck in medical imaging for clinicians by improving multi-modal alignment in chest X-ray analysis, though it is incremental as it builds on existing latent diffusion models.

The authors tackled the problem of poor alignment between free-text radiology reports and chest X-ray scans in latent diffusion models, proposing a fine-tuning framework that achieves state-of-the-art results on the MS-CXR benchmark and robust performance on out-of-distribution data.

Latent Diffusion Models have shown remarkable results in text-guided image synthesis in recent years. In the domain of natural (RGB) images, recent works have shown that such models can be adapted to various vision-language downstream tasks with little to no supervision involved. On the contrary, text-to-image Latent Diffusion Models remain relatively underexplored in the field of medical imaging, primarily due to limited data availability (e.g., due to privacy concerns). In this work, focusing on the chest X-ray modality, we first demonstrate that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan. Then, to alleviate this issue, we propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding. Our method sets a new state-of-the-art on a standard benchmark dataset (MS-CXR), while also exhibiting robust performance on out-of-distribution data (VinDr-CXR). Our code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes