IV AI CVOct 29, 2025

Transformers in Medicine: Improving Vision-Language Alignment for Medical Image Captioning

Yogesh Thakku Suresh, Vishwajeet Shivaji Hogale, Luca-Alexandru Zamfira, Anandavardhana Hegde

arXiv:2510.25164v2

Originality Synthesis-oriented

AI Analysis

This work addresses automated medical image reporting for clinicians, but it is incremental as it builds on existing transformer and LSTM methods with domain-specific tuning.

The authors tackled medical image captioning for MRI scans by developing a transformer-based multimodal framework that combines DEiT-Small, MediCareBERT, and an LSTM decoder, achieving improved caption accuracy and semantic alignment when focusing on domain-specific brain-only MRIs compared to general MRI images.

We present a transformer-based multimodal framework for generating clinically relevant captions for MRI scans. Our system combines a DEiT-Small vision transformer as an image encoder, MediCareBERT for caption embedding, and a custom LSTM-based decoder. The architecture is designed to semantically align image and textual embeddings, using hybrid cosine-MSE loss and contrastive inference via vector similarity. We benchmark our method on the MultiCaRe dataset, comparing performance on filtered brain-only MRIs versus general MRI images against state-of-the-art medical image captioning methods including BLIP, R2GenGPT, and recent transformer-based approaches. Results show that focusing on domain-specific data improves caption accuracy and semantic alignment. Our work proposes a scalable, interpretable solution for automated medical image reporting.

View on arXiv PDF

Similar