CVJun 25, 2025

Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

arXiv:2506.20449v14 citationsh-index: 3DGM4MICCAI@MICCAI
Originality Incremental advance
AI Analysis

This addresses the challenge of medical image generation for healthcare applications, but it is incremental as it builds on existing models with specific adaptations.

The paper tackled the problem of generating medical images from text with limited data by proposing Med-Art, a framework that adapts a pre-trained diffusion transformer and uses a hybrid-level fine-tuning method, achieving state-of-the-art performance on two datasets as measured by FID, KID, and classification metrics.

Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt-$α$, based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve state-of-the-art performance on two medical image datasets, measured by FID, KID, and downstream classification performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes