STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering
This work addresses data efficiency for medical AI practitioners by reducing reliance on expensive labeled datasets, though it is incremental as it builds on existing LVLM and DPO techniques.
The paper tackled the problem of high-quality visual instruction data being costly and labor-intensive in medical image understanding by introducing STLLaVA-Med, a self-training method that auto-generates medical visual instruction data using Direct Preference Optimization, achieving competitive zero-shot performance on three medical VQA benchmarks with only 9% of the medical data.
Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.