CVAILGNov 11, 2025

Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation

arXiv:2511.08402v13 citationsh-index: 17
Originality Highly original
AI Analysis

This addresses the problem of imaging heterogeneity in medical diagnosis for clinicians, offering a novel method to incorporate fine-grained details, though it is incremental in improving upon existing vision-language models.

The paper tackles the challenge of accurate disease interpretation from radiology by developing Anatomy-VLM, a fine-grained vision-language model that localizes anatomical features and integrates structured knowledge, achieving outstanding performance on in- and out-of-distribution datasets and enabling zero-shot anatomy-wise interpretation.

Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM's encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes