CVMay 22, 2025

Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification

arXiv:2505.16338v1h-index: 28
Originality Synthesis-oriented
AI Analysis

This work addresses skin cancer diagnosis by evaluating foundation models for dermatology, but it is incremental as it builds on existing models and fusion techniques.

The study tackled skin lesion classification from dermatoscopic images by comparing a dermatology-specific foundation model (PanDerm) with Vision Transformer architectures, finding that a PanDerm-based MLP model performed comparably to a fine-tuned Swin Transformer and that fusing their predictions improved performance.

Accurate classification of skin lesions from dermatoscopic images is essential for diagnosis and treatment of skin cancer. In this study, we investigate the utility of a dermatology-specific foundation model, PanDerm, in comparison with two Vision Transformer (ViT) architectures (ViT base and Swin Transformer V2 base) for the task of skin lesion classification. Using frozen features extracted from PanDerm, we apply non-linear probing with three different classifiers, namely, multi-layer perceptron (MLP), XGBoost, and TabNet. For the ViT-based models, we perform full fine-tuning to optimize classification performance. Our experiments on the HAM10000 and MSKCC datasets demonstrate that the PanDerm-based MLP model performs comparably to the fine-tuned Swin transformer model, while fusion of PanDerm and Swin Transformer predictions leads to further performance improvements. Future work will explore additional foundation models, fine-tuning strategies, and advanced fusion techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes