CVMar 22

SkinCLIP-VL: Consistency-Aware Vision-Language Learning for Multimodal Skin Cancer Diagnosis

Zhixiang Lu, Shijie Xu, Kaicheng Yan, Xuyue Cai, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Jionglong Su

arXiv:2603.2101049.13 citationsh-index: 1

AI Analysis

This work addresses the problem of trustworthy and efficient multimodal diagnosis for dermatology, representing a domain-specific incremental improvement.

The paper tackled the challenges of high computational costs, data scarcity, and lack of trust in vision-language models for skin cancer diagnosis by proposing SkinCLIP-VL, a resource-efficient framework that achieved 4.3-6.2% higher accuracy than 13B-parameter baselines with 43% fewer parameters on ISIC and Derm7pt benchmarks.

The deployment of vision-language models (VLMs) in dermatology is hindered by the trilemma of high computational costs, extreme data scarcity, and the black-box nature of deep learning. To address these challenges, we present SkinCLIP-VL, a resource-efficient framework that adapts foundation models for trustworthy skin cancer diagnosis. Adopting a frozen perception, adaptive reasoning paradigm, we integrate a frozen CLIP encoder with a lightweight, quantized Qwen2.5-VL via low-rank adaptation (LoRA). To strictly align visual regions with clinical semantics under long-tailed distributions, we propose the Consistency-aware Focal Alignment (CFA) Loss. This objective synergizes focal re-weighting, semantic alignment, and calibration. On ISIC and Derm7pt benchmarks, SkinCLIP-VL surpasses 13B-parameter baselines by 4.3-6.2% in accuracy with 43% fewer parameters. Crucially, blinded expert evaluation and out-of-distribution testing confirm that our visually grounded rationales significantly enhance clinical trust compared to traditional saliency maps.

View on arXiv PDF

Similar