CV AIMar 10

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia, Harvey Lui, Chunqi Chang, Z. Jane Wang, Tim K. Lee

arXiv:2603.09108v14.5h-index: 2

Predicted impact top 85% in CV · last 90 daysOriginality Highly original

AI Analysis

This work addresses the problem of retrieving relevant medical cases for skin cancer diagnosis, which is incremental as it builds on existing retrieval methods with a novel alignment approach.

The paper tackled composed vision-language retrieval for skin cancer by proposing a transformer-based framework that jointly aligns global and local representations, achieving consistent improvements over state-of-the-art methods on the Derm7pt dataset.

Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.

View on arXiv PDF

Similar