CVMar 22

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

arXiv:2603.2101049.13 citationsh-index: 1
AI Analysis

This work addresses the problem of trustworthy and efficient multimodal diagnosis for dermatology, representing a domain-specific incremental improvement.

The paper tackled the challenges of high computational costs, data scarcity, and lack of trust in vision-language models for skin cancer diagnosis by proposing SkinCLIP-VL, a resource-efficient framework that achieved 4.3-6.2% higher accuracy than 13B-parameter baselines with 43% fewer parameters on ISIC and Derm7pt benchmarks.

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes