CV LGAug 27, 2025

Grounding Multimodal Large Language Models with Quantitative Skin Attributes: A Retrieval Study

Max Torop, Masih Eskandar, Nicholas Kurtansky, Jinyang Liu, Jochen Weber, Octavia Camps, Veronica Rotemberg, Jennifer Dy, Kivanc Kose

arXiv:2508.20188v13.6h-index: 12

Originality Synthesis-oriented

AI Analysis

This work addresses interpretability for clinicians using AI in dermatology, but it is incremental as it builds on existing MLLM and attribute-based approaches without demonstrating broad clinical impact.

The study tackled the problem of improving interpretability in AI models for skin disease diagnosis by exploring whether Multimodal Large Language Models (MLLMs) can be grounded in quantitative skin attributes like lesion area, and found evidence through a retrieval case study on the SLICE-3D dataset that MLLM embedding spaces can be aligned with these attributes via fine-tuning.

Artificial Intelligence models have demonstrated significant success in diagnosing skin diseases, including cancer, showing the potential to assist clinicians in their analysis. However, the interpretability of model predictions must be significantly improved before they can be used in practice. To this end, we explore the combination of two promising approaches: Multimodal Large Language Models (MLLMs) and quantitative attribute usage. MLLMs offer a potential avenue for increased interpretability, providing reasoning for diagnosis in natural language through an interactive format. Separately, a number of quantitative attributes that are related to lesion appearance (e.g., lesion area) have recently been found predictive of malignancy with high accuracy. Predictions grounded as a function of such concepts have the potential for improved interpretability. We provide evidence that MLLM embedding spaces can be grounded in such attributes, through fine-tuning to predict their values from images. Concretely, we evaluate this grounding in the embedding space through an attribute-specific content-based image retrieval case study using the SLICE-3D dataset.

View on arXiv PDF

Similar