IVAICVMay 20, 2025

MedBLIP: Fine-tuning BLIP for Medical Image Captioning

arXiv:2505.14726v17 citationsh-index: 4
Originality Synthesis-oriented
AI Analysis

This work addresses the need for clinically precise radiology image descriptions, but it is incremental as it applies existing fine-tuning techniques to a specialized domain.

The study tackled the problem of generating accurate medical image captions by fine-tuning the BLIP model on the ROCO dataset, resulting in significant performance improvements, with decoder-only fine-tuning offering a 5% reduction in training time while full fine-tuning achieved the best results.

Medical image captioning is a challenging task that requires generating clinically accurate and semantically meaningful descriptions of radiology images. While recent vision-language models (VLMs) such as BLIP, BLIP2, Gemini and ViT-GPT2 show strong performance on natural image datasets, they often produce generic or imprecise captions when applied to specialized medical domains. In this project, we explore the effectiveness of fine-tuning the BLIP model on the ROCO dataset for improved radiology captioning. We compare the fine-tuned BLIP against its zero-shot version, BLIP-2 base, BLIP-2 Instruct and a ViT-GPT2 transformer baseline. Our results demonstrate that domain-specific fine-tuning on BLIP significantly improves performance across both quantitative and qualitative evaluation metrics. We also visualize decoder cross-attention maps to assess interpretability and conduct an ablation study to evaluate the contributions of encoder-only and decoder-only fine-tuning. Our findings highlight the importance of targeted adaptation for medical applications and suggest that decoder-only fine-tuning (encoder-frozen) offers a strong performance baseline with 5% lower training time than full fine-tuning, while full model fine-tuning still yields the best results overall.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes