CVLGIVDec 23, 2024

GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning

arXiv:2412.17251v13 citationsh-index: 14ICASSP
Originality Incremental advance
AI Analysis

This work addresses the problem of automated retinal image captioning for medical diagnosis, but it appears incremental as it builds on existing Transformer-based models with a novel attention mechanism.

The paper tackled the challenge of generating accurate medical reports from retinal images under limited labeled data by proposing a vision-language model with guided context self-attention, achieving a 0.023 BLEU@4 improvement on the DeepEyeNet dataset.

Retinal image analysis is crucial for diagnosing and treating eye diseases, yet generating accurate medical reports from images remains challenging due to variability in image quality and pathology, especially with limited labeled data. Previous Transformer-based models struggled to integrate visual and textual information under limited supervision. In response, we propose a novel vision-language model for retinal image captioning that combines visual and textual features through a guided context self-attention mechanism. This approach captures both intricate details and the global clinical context, even in data-scarce scenarios. Extensive experiments on the DeepEyeNet dataset demonstrate a 0.023 BLEU@4 improvement, along with significant qualitative advancements, highlighting the effectiveness of our model in generating comprehensive medical captions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes