CVMar 11

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

arXiv:2603.10519v116.1h-index: 4Has Code
Predicted impact top 54% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses data scarcity and privacy issues in medical imaging by enhancing controllability in image generation, though it is incremental as it builds on existing text-to-image and disentanglement techniques.

The paper tackled the challenge of fine-tuning text-to-image models for medical image synthesis by addressing the modality gap and semantic entanglement between visual details and clinical text, resulting in a method that outperforms existing approaches in generation quality and improves downstream classification performance on three datasets.

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes