CVMar 19

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

arXiv:2603.1948272.4h-index: 8

AI Analysis

This addresses the problem of constructing high-quality instruction datasets for medical AI applications, offering a more efficient method for domain-specific fine-tuning.

The paper tackles the challenge of fine-tuning large vision language models in the medical domain without requiring curated instruction datasets by proposing an instruction-free tuning approach using image-description pairs and a momentum proxy instruction. It achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across medical datasets like SKINCON, WBCAtt, CBIS, and MIMIC-CXR, significantly enhancing fine-tuning efficiency.

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.

View on arXiv PDF

Similar