Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
This work addresses the problem of generating accurate and interpretable diagnostic narratives from radiological images for clinical reporting workflows, representing a domain-specific incremental improvement.
The paper tackles automated medical image captioning by proposing a Swin-BART encoder-decoder system with a regional attention module to enhance diagnostically salient regions, achieving state-of-the-art semantic fidelity with ROUGE of 0.603 and BERTScore of 0.807 on the ROCO dataset.
Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95\%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no\_repeat\_ngram\_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.