Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning
This addresses accuracy issues in medical AI applications, but it is incremental as it builds on existing methods with novel tuning techniques.
The paper tackled the problem of suboptimal attention distribution in Medical Large Vision-Language Models, which causes hallucinations and inaccuracies, by proposing A$^3$Tune, a fine-tuning framework that improved performance on medical VQA and report generation benchmarks, outperforming state-of-the-art baselines.
Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A$^3$Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A$^3$Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A$^3$MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A$^3$Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.