CV AI CL LGDec 9, 2024

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao

arXiv:2412.06141v417.327 citationsh-index: 15Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses modality misalignment in medical AI systems, which is crucial for improving diagnostic reliability and reducing errors in clinical settings, though it is incremental as it builds on existing preference optimization methods.

The paper tackles the problem of factuality challenges in Medical Vision-Language Models (Med-LVLMs) due to modality misalignment, where models prioritize text over visual input, leading to hallucinations. It proposes MMedPO, a multimodal preference optimization approach that enhances alignment by considering clinical relevance, resulting in average improvements of 14.2% and 51.7% in factual accuracy across Med-VQA and report generation tasks.

The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately mitigated clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. To address this challenge, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, and integrate these scores into the preference optimization process as weights, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by averaging 14.2% and 51.7% across the Med-VQA and report generation tasks. Our code are available in https://github.com/aiming-lab/MMedPO.

View on arXiv PDF Code

Similar