CVApr 29, 2025

MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation

arXiv:2504.20343v13 citationsh-index: 3Has CodeIJCNN
Originality Incremental advance
AI Analysis

This work addresses medical image reporting for radiology and clinical applications, offering improvements in accuracy and alignment, but it appears incremental as it builds on existing vision-language and mixture-of-experts approaches.

The paper tackled the problem of generating clinical descriptions from medical images by addressing fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, achieving state-of-the-art results on datasets including COVCTR, MMR, PGROSS, and ROCO.

Medical image reporting (MIR) aims to generate structured clinical descriptions from radiological images. Existing methods struggle with fine-grained feature extraction, multimodal alignment, and generalization across diverse imaging types, often relying on vanilla transformers and focusing primarily on chest X-rays. We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion, designed to address these limitations. Our architecture includes: (i) a multiscale vision encoder (MSVE) for capturing anatomical details at varying resolutions, (ii) a multihead dual-branch latent attention (MDLA) module for vision-language alignment through latent bottleneck representations, and (iii) a modulated mixture-of-experts (MoE) decoder for adaptive expert specialization. We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results on COVCTR, MMR, PGROSS, and ROCO datasets. Extensive experiments and ablations confirm improved clinical accuracy, cross-modal alignment, and model interpretability. Code is available at https://github.com/AI-14/micar-vl-moe.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes