CVAIJun 30, 2025

VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

arXiv:2506.23641v12 citationsh-index: 15MICCAI
Originality Incremental advance
AI Analysis

This work addresses the challenge of generating detailed medical images for healthcare applications, but it is incremental as it builds on existing diffusion models and MLLMs.

The paper tackles the problem of generating realistic and diverse medical images by addressing the lack of detailed attribute descriptions, using a framework called VAP-Diffusion that leverages pre-trained Multi-modal Large Language Models to enrich prompts, resulting in improved quality and diversity as verified on three medical imaging types across four datasets.

As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes