CVApr 4, 2023Code
EPVT: Environment-aware Prompt Vision Transformer for Domain Generalization in Skin Lesion RecognitionSiyuan Yan, Chi Liu, Zhen Yu et al.
Skin lesion recognition using deep learning has made remarkable progress, and there is an increasing need for deploying these systems in real-world scenarios. However, recent research has revealed that deep neural networks for skin lesion recognition may overly depend on disease-irrelevant image artifacts (i.e., dark corners, dense hairs), leading to poor generalization in unseen environments. To address this issue, we propose a novel domain generalization method called EPVT, which involves embedding prompts into the vision transformer to collaboratively learn knowledge from diverse domains. Concretely, EPVT leverages a set of domain prompts, each of which plays as a domain expert, to capture domain-specific knowledge; and a shared prompt for general knowledge over the entire dataset. To facilitate knowledge sharing and the interaction of different prompts, we introduce a domain prompt generator that enables low-rank multiplicative updates between domain prompts and the shared prompt. A domain mixup strategy is additionally devised to reduce the co-occurring artifacts in each domain, which allows for more flexible decision margins and mitigates the issue of incorrectly assigned domain labels. Experiments on four out-of-distribution datasets and six different biased ISIC datasets demonstrate the superior generalization ability of EPVT in skin lesion recognition across various environments. Code is avaliable at https://github.com/SiyuanYan1/EPVT.
CVApr 8, 2023Code
Universal Semi-Supervised Learning for Medical Image ClassificationLie Ju, Yicheng Wu, Wei Feng et al.
Semi-supervised learning (SSL) has attracted much attention since it reduces the expensive costs of collecting adequate well-labeled training data, especially for deep learning methods. However, traditional SSL is built upon an assumption that labeled and unlabeled data should be from the same distribution \textit{e.g.,} classes and domains. However, in practical scenarios, unlabeled data would be from unseen classes or unseen domains, and it is still challenging to exploit them by existing SSL methods. Therefore, in this paper, we proposed a unified framework to leverage these unseen unlabeled data for open-scenario semi-supervised medical image classification. We first design a novel scoring mechanism, called dual-path outliers estimation, to identify samples from unseen classes. Meanwhile, to extract unseen-domain samples, we then apply an effective variational autoencoder (VAE) pre-training. After that, we conduct domain adaptation to fully exploit the value of the detected unseen-domain samples to boost semi-supervised training. We evaluated our proposed framework on dermatology and ophthalmology tasks. Extensive experiments demonstrate our model can achieve superior classification performance in various medical SSL scenarios. The code implementations are accessible at: https://github.com/PyJulie/USSL4MIC.
CVOct 20, 2023Code
NurViD: A Large Expert-Level Video Database for Nursing Procedure Activity UnderstandingMing Hu, Lin Wang, Siyuan Yan et al.
The application of deep learning to nursing procedure activity understanding has the potential to greatly enhance the quality and safety of nurse-patient interactions. By utilizing the technique, we can facilitate training and education, improve quality control, and enable operational compliance monitoring. However, the development of automatic recognition systems in this field is currently hindered by the scarcity of appropriately labeled datasets. The existing video datasets pose several limitations: 1) these datasets are small-scale in size to support comprehensive investigations of nursing activity; 2) they primarily focus on single procedures, lacking expert-level annotations for various nursing procedures and action steps; and 3) they lack temporally localized annotations, which prevents the effective localization of targeted actions within longer video sequences. To mitigate these limitations, we propose NurViD, a large video dataset with expert-level annotation for nursing procedure activity understanding. NurViD consists of over 1.5k videos totaling 144 hours, making it approximately four times longer than the existing largest nursing activity datasets. Notably, it encompasses 51 distinct nursing procedures and 177 action steps, providing a much more comprehensive coverage compared to existing datasets that primarily focus on limited procedures. To evaluate the efficacy of current deep learning methods on nursing activity understanding, we establish three benchmarks on NurViD: procedure recognition on untrimmed videos, procedure and action recognition on trimmed videos, and action detection. Our benchmark and code will be available at \url{https://github.com/minghu0830/NurViD-benchmark}.
CVNov 23, 2023Code
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical UnderstandingPeng Xia, Xingtong Yu, Ming Hu et al.
Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (HGCLIP) that effectively combines CLIP with a deeper exploitation of the Hierarchical class structure via Graph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https://github.com/richard-peng-xia/HGCLIP.
CVSep 28, 2023
Towards Novel Class Discovery: A Study in Novel Skin Lesions ClusteringWei Feng, Lie Ju, Lin Wang et al. · ibm-research
Existing deep learning models have achieved promising performance in recognizing skin diseases from dermoscopic images. However, these models can only recognize samples from predefined categories, when they are deployed in the clinic, data from new unknown categories are constantly emerging. Therefore, it is crucial to automatically discover and identify new semantic categories from new data. In this paper, we propose a new novel class discovery framework for automatically discovering new semantic classes from dermoscopy image datasets based on the knowledge of known classes. Specifically, we first use contrastive learning to learn a robust and unbiased feature representation based on all data from known and unknown categories. We then propose an uncertainty-aware multi-view cross pseudo-supervision strategy, which is trained jointly on all categories of data using pseudo labels generated by a self-labeling strategy. Finally, we further refine the pseudo label by aggregating neighborhood information through local sample similarity to improve the clustering performance of the model for unknown categories. We conducted extensive experiments on the dermatology dataset ISIC 2019, and the experimental results show that our approach can effectively leverage knowledge from known categories to discover new semantic categories. We also further validated the effectiveness of the different modules through extensive ablation experiments. Our code will be released soon.
CVSep 13, 2022
Skin Lesion Recognition with Class-Hierarchy Regularized Hyperbolic EmbeddingsZhen Yu, Toan Nguyen, Yaniv Gal et al.
In practice, many medical datasets have an underlying taxonomy defined over the disease label space. However, existing classification algorithms for medical diagnoses often assume semantically independent labels. In this study, we aim to leverage class hierarchy with deep learning algorithms for more accurate and reliable skin lesion recognition. We propose a hyperbolic network to learn image embeddings and class prototypes jointly. The hyperbola provably provides a space for modeling hierarchical relations better than Euclidean geometry. Meanwhile, we restrict the distribution of hyperbolic prototypes with a distance matrix that is encoded from the class hierarchy. Accordingly, the learned prototypes preserve the semantic class relations in the embedding space and we can predict the label of an image by assigning its feature to the nearest hyperbolic class prototype. We use an in-house skin lesion dataset which consists of around 230k dermoscopic images on 65 skin diseases to verify our method. Extensive experiments provide evidence that our model can achieve higher accuracy with less severe classification errors than models without considering class relations.
CVJul 8, 2022
Unsupervised Domain Adaptive Fundus Image Segmentation with Category-level RegularizationWei Feng, Lin Wang, Lie Ju et al.
Existing unsupervised domain adaptation methods based on adversarial learning have achieved good performance in several medical imaging tasks. However, these methods focus only on global distribution adaptation and ignore distribution constraints at the category level, which would lead to sub-optimal adaptation performance. This paper presents an unsupervised domain adaptation framework based on category-level regularization that regularizes the category distribution from three perspectives. Specifically, for inter-domain category regularization, an adaptive prototype alignment module is proposed to align feature prototypes of the same category in the source and target domains. In addition, for intra-domain category regularization, we tailored a regularization technique for the source and target domains, respectively. In the source domain, a prototype-guided discriminative loss is proposed to learn more discriminative feature representations by enforcing intra-class compactness and inter-class separability, and as a complement to traditional supervised loss. In the target domain, an augmented consistency category regularization loss is proposed to force the model to produce consistent predictions for augmented/unaugmented target images, which encourages semantically similar regions to be given the same label. Extensive experiments on two publicly fundus datasets show that the proposed approach significantly outperforms other state-of-the-art comparison algorithms.
CVApr 7, 2022
Flexible Sampling for Long-tailed Skin Lesion ClassificationLie Ju, Yicheng Wu, Lin Wang et al.
Most of the medical tasks naturally exhibit a long-tailed distribution due to the complex patient-level conditions and the existence of rare diseases. Existing long-tailed learning methods usually treat each class equally to re-balance the long-tailed distribution. However, considering that some challenging classes may present diverse intra-class distributions, re-balancing all classes equally may lead to a significant performance drop. To address this, in this paper, we propose a curriculum learning-based framework called Flexible Sampling for the long-tailed skin lesion classification task. Specifically, we initially sample a subset of training data as anchor points based on the individual class prototypes. Then, these anchor points are used to pre-train an inference model to evaluate the per-class learning difficulty. Finally, we use a curriculum sampling module to dynamically query new samples from the rest training samples with the learning difficulty-aware sampling probability. We evaluated our model against several state-of-the-art methods on the ISIC dataset. The results with two long-tailed settings have demonstrated the superiority of our proposed training strategy, which achieves a new benchmark for long-tailed skin lesion classification.
74.7CVMay 14Code
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-MakingYize Liu, Siyuan Yan, Ming Hu et al.
Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.
IVOct 11, 2022
3D Matting: A Benchmark Study on Soft Segmentation Method for Pulmonary Nodules Applied in Computed TomographyLin Wang, Xiufen Ye, Donghao Zhang et al.
Usually, lesions are not isolated but are associated with the surrounding tissues. For example, the growth of a tumour can depend on or infiltrate into the surrounding tissues. Due to the pathological nature of the lesions, it is challenging to distinguish their boundaries in medical imaging. However, these uncertain regions may contain diagnostic information. Therefore, the simple binarization of lesions by traditional binary segmentation can result in the loss of diagnostic information. In this work, we introduce the image matting into the 3D scenes and use the alpha matte, i.e., a soft mask, to describe lesions in a 3D medical image. The traditional soft mask acted as a training trick to compensate for the easily mislabelled or under-labelled ambiguous regions. In contrast, 3D matting uses soft segmentation to characterize the uncertain regions more finely, which means that it retains more structural information for subsequent diagnosis and treatment. The current study of image matting methods in 3D is limited. To address this issue, we conduct a comprehensive study of 3D matting, including both traditional and deep-learning-based methods. We adapt four state-of-the-art 2D image matting algorithms to 3D scenes and further customize the methods for CT images to calibrate the alpha matte with the radiodensity. Moreover, we propose the first end-to-end deep 3D matting network and implement a solid 3D medical image matting benchmark. Its efficient counterparts are also proposed to achieve a good performance-computation balance. Furthermore, there is no high-quality annotated dataset related to 3D matting, slowing down the development of data-driven deep-learning-based methods. To address this issue, we construct the first 3D medical matting dataset. The validity of the dataset was verified through clinicians' assessments and downstream experiments.
CVFeb 11
A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in DermatologySiyuan Yan, Xieji Li, Dan Mo et al.
Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.
CVDec 10, 2025Code
Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and OutlookYuan Ma, Junlin Hou, Chao Zhang et al.
Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.
IVSep 16, 2022
3D Matting: A Soft Segmentation Method Applied in Computed TomographyLin Wang, Xiufen Ye, Donghao Zhang et al.
Three-dimensional (3D) images, such as CT, MRI, and PET, are common in medical imaging applications and important in clinical diagnosis. Semantic ambiguity is a typical feature of many medical image labels. It can be caused by many factors, such as the imaging properties, pathological anatomy, and the weak representation of the binary masks, which brings challenges to accurate 3D segmentation. In 2D medical images, using soft masks instead of binary masks generated by image matting to characterize lesions can provide rich semantic information, describe the structural characteristics of lesions more comprehensively, and thus benefit the subsequent diagnoses and analyses. In this work, we introduce image matting into the 3D scenes to describe the lesions in 3D medical images. The study of image matting in 3D modality is limited, and there is no high-quality annotated dataset related to 3D matting, therefore slowing down the development of data-driven deep-learning-based methods. To address this issue, we constructed the first 3D medical matting dataset and convincingly verified the validity of the dataset through quality control and downstream experiments in lung nodules classification. We then adapt the four selected state-of-the-art 2D image matting algorithms to 3D scenes and further customize the methods for CT images. Also, we propose the first end-to-end deep 3D matting network and implement a solid 3D medical image matting benchmark, which will be released to encourage further research.
CVDec 16, 2025
Native Intelligence Emerges from Large-Scale Clinical Practice: A Retinal Foundation Model with Deployment EfficiencyJia Guo, Jiawei Du, Shengzhu Yang et al.
Current retinal foundation models remain constrained by curated research datasets that lack authentic clinical context, and require extensive task-specific optimization for each application, limiting their deployment efficiency in low-resource settings. Here, we show that these barriers can be overcome by building clinical native intelligence directly from real-world medical practice. Our key insight is that large-scale telemedicine programs, where expert centers provide remote consultations across distributed facilities, represent a natural reservoir for learning clinical image interpretation. We present ReVision, a retinal foundation model that learns from the natural alignment between 485,980 color fundus photographs and their corresponding diagnostic reports, accumulated through a decade-long telemedicine program spanning 162 medical institutions across China. Through extensive evaluation across 27 ophthalmic benchmarks, we demonstrate that ReVison enables deployment efficiency with minimal local resources. Without any task-specific training, ReVision achieves zero-shot disease detection with an average AUROC of 0.946 across 12 public benchmarks and 0.952 on 3 independent clinical cohorts. When minimal adaptation is feasible, ReVision matches extensively fine-tuned alternatives while requiring orders of magnitude fewer trainable parameters and labeled examples. The learned representations also transfer effectively to new clinical sites, imaging domains, imaging modalities, and systemic health prediction tasks. In a prospective reader study with 33 ophthalmologists, ReVision's zero-shot assistance improved diagnostic accuracy by 14.8% across all experience levels. These results demonstrate that clinical native intelligence can be directly extracted from clinical archives without any further annotation to build medical AI systems suited to various low-resource settings.
LGJan 16
oculomix: Hierarchical Sampling for Retinal-Based Systemic Disease PredictionHyunmin Kim, Yukun Zhou, Rahul A. Jonas et al.
Oculomics - the concept of predicting systemic diseases, such as cardiovascular disease and dementia, through retinal imaging - has advanced rapidly due to the data efficiency of transformer-based foundation models like RETFound. Image-level mixed sample data augmentations, such as CutMix and MixUp, are frequently used for training transformers, yet these techniques perturb patient-specific attributes, such as medical comorbidity and clinical factors, since they only account for images and labels. To address this limitation, we propose a hierarchical sampling strategy, Oculomix, for mixed sample augmentations. Our method is based on two clinical priors. First (exam level), images acquired from the same patient at the same time point share the same attributes. Second (patient level), images acquired from the same patient at different time points have a soft temporal trend, as morbidity generally increases over time. Guided by these priors, our method constrains the mixing space to the patient and exam levels to better preserve patient-specific characteristics and leverages their hierarchical relationships. The proposed method is validated using ViT models on a five-year prediction of major adverse cardiovascular events (MACE) in a large ethnically diverse population (Alzeye). We show that Oculomix consistently outperforms image-level CutMix and MixUp by up to 3% in AUROC, demonstrating the necessity and value of the proposed method in oculomics.
CVOct 19, 2024Code
A Multimodal Vision Foundation Model for Clinical DermatologySiyuan Yan, Zhen Yu, Clare Primiero et al.
Diagnosing and treating skin diseases require advanced visual skills across domains and the ability to synthesize information from multiple imaging modalities. While current deep learning models excel at specific tasks like skin cancer diagnosis from dermoscopic images, they struggle to meet the complex, multimodal requirements of clinical practice. Here, we introduce PanDerm, a multimodal dermatology foundation model pretrained through self-supervised learning on over 2 million real-world skin disease images from 11 clinical institutions across 4 imaging modalities. We evaluated PanDerm on 28 diverse benchmarks, including skin cancer screening, risk stratification, differential diagnosis of common and rare skin conditions, lesion segmentation, longitudinal monitoring, and metastasis prediction and prognosis. PanDerm achieved state-of-the-art performance across all evaluated tasks, often outperforming existing models when using only 10% of labeled data. We conducted three reader studies to assess PanDerm's potential clinical utility. PanDerm outperformed clinicians by 10.2% in early-stage melanoma detection through longitudinal analysis, improved clinicians' skin cancer diagnostic accuracy by 11% on dermoscopy images, and enhanced non-dermatologist healthcare providers' differential diagnosis by 16.5% across 128 skin conditions on clinical photographs. These results demonstrate PanDerm's potential to improve patient care across diverse clinical scenarios and serve as a model for developing multimodal foundation models in other medical specialties, potentially accelerating the integration of AI support in healthcare. The code can be found at https://github.com/SiyuanYan1/PanDerm.
IVJan 5, 2024Code
Prompt-driven Latent Domain Generalization for Medical Image ClassificationSiyuan Yan, Chi Liu, Zhen Yu et al.
Deep learning models for medical image analysis easily suffer from distribution shifts caused by dataset artifacts bias, camera variations, differences in the imaging station, etc., leading to unreliable diagnoses in real-world clinical settings. Domain generalization (DG) methods, which aim to train models on multiple domains to perform well on unseen domains, offer a promising direction to solve the problem. However, existing DG methods assume domain labels of each image are available and accurate, which is typically feasible for only a limited number of medical datasets. To address these challenges, we propose a novel DG framework for medical image classification without relying on domain labels, called Prompt-driven Latent Domain Generalization (PLDG). PLDG consists of unsupervised domain discovery and prompt learning. This framework first discovers pseudo domain labels by clustering the bias-associated style features, then leverages collaborative domain prompts to guide a Vision Transformer to learn knowledge from discovered diverse domains. To facilitate cross-domain knowledge learning between different prompts, we introduce a domain prompt generator that enables knowledge sharing between domain prompts and a shared prompt. A domain mixup strategy is additionally employed for more flexible decision margins and mitigates the risk of incorrect domain assignments. Extensive experiments on three medical image classification tasks and one debiasing task demonstrate that our method can achieve comparable or even superior performance than conventional DG algorithms without relying on domain labels. Our code will be publicly available upon the paper is accepted.
CVMar 20, 2024Code
Diversified and Personalized Multi-rater Medical Image SegmentationYicheng Wu, Xiangde Luo, Zhe Xu et al.
Annotation ambiguity due to inherent data uncertainties such as blurred boundaries in medical scans and different observer expertise and preferences has become a major obstacle for training deep-learning based medical image segmentation models. To address it, the common practice is to gather multiple annotations from different experts, leading to the setting of multi-rater medical image segmentation. Existing works aim to either merge different annotations into the "groundtruth" that is often unattainable in numerous medical contexts, or generate diverse results, or produce personalized results corresponding to individual expert raters. Here, we bring up a more ambitious goal for multi-rater medical image segmentation, i.e., obtaining both diversified and personalized results. Specifically, we propose a two-stage framework named D-Persona (first Diversification and then Personalization). In Stage I, we exploit multiple given annotations to train a Probabilistic U-Net model, with a bound-constrained loss to improve the prediction diversity. In this way, a common latent space is constructed in Stage I, where different latent codes denote diversified expert opinions. Then, in Stage II, we design multiple attention-based projection heads to adaptively query the corresponding expert prompts from the shared latent space, and then perform the personalized medical image segmentation. We evaluated the proposed model on our in-house Nasopharyngeal Carcinoma dataset and the public lung nodule dataset (i.e., LIDC-IDRI). Extensive experiments demonstrated our D-Persona can provide diversified and personalized results at the same time, achieving new SOTA performance for multi-rater medical image segmentation. Our code will be released at https://github.com/ycwu1997/D-Persona.
CVJun 10, 2024Code
Generalizing to Unseen Domains in Diabetic Retinopathy with Disentangled RepresentationsPeng Xia, Ming Hu, Feilong Tang et al.
Diabetic Retinopathy (DR), induced by diabetes, poses a significant risk of visual impairment. Accurate and effective grading of DR aids in the treatment of this condition. Yet existing models experience notable performance degradation on unseen domains due to domain shifts. Previous methods address this issue by simulating domain style through simple visual transformation and mitigating domain noise via learning robust representations. However, domain shifts encompass more than image styles. They overlook biases caused by implicit factors such as ethnicity, age, and diagnostic criteria. In our work, we propose a novel framework where representations of paired data from different domains are decoupled into semantic features and domain noise. The resulting augmented representation comprises original retinal semantics and domain noise from other domains, aiming to generate enhanced representations aligned with real-world clinical needs, incorporating rich information from diverse domains. Subsequently, to improve the robustness of the decoupled representations, class and domain prototypes are employed to interpolate the disentangled representations while data-aware weights are designed to focus on rare classes and domains. Finally, we devise a robust pixel-level semantic alignment loss to align retinal semantics decoupled from features, maintaining a balance between intra-class diversity and dense class features. Experimental results on multiple benchmarks demonstrate the effectiveness of our method on unseen domains. The code implementations are accessible on https://github.com/richard-peng-xia/DECO.
CVMar 2, 2025Code
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsLie Ju, Sijin Zhou, Yukun Zhou et al.
Recent advances in medical vision-language models (VLMs) demonstrate impressive performance in image classification tasks, driven by their strong zero-shot generalization capabilities. However, given the high variability and complexity inherent in medical imaging data, the ability of these models to detect out-of-distribution (OOD) data in this domain remains underexplored. In this work, we conduct the first systematic investigation into the OOD detection potential of medical VLMs. We evaluate state-of-the-art VLM-based OOD detection methods across a diverse set of medical VLMs, including both general and domain-specific purposes. To accurately reflect real-world challenges, we introduce a cross-modality evaluation pipeline for benchmarking full-spectrum OOD detection, rigorously assessing model robustness against both semantic shifts and covariate shifts. Furthermore, we propose a novel hierarchical prompt-based method that significantly enhances OOD detection performance. Extensive experiments are conducted to validate the effectiveness of our approach. The codes are available at https://github.com/PyJulie/Medical-VLMs-OOD-Detection.
CVMay 8, 2023Code
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual RecognitionPeng Xia, Di Xu, Ming Hu et al.
Long-tailed multi-label visual recognition (LTML) task is a highly challenging task due to the label co-occurrence and imbalanced data distribution. In this work, we propose a unified framework for LTML, namely prompt tuning with class-specific embedding loss (LMPT), capturing the semantic feature interactions between categories by combining text and image modality data and improving the performance synchronously on both head and tail classes. Specifically, LMPT introduces the embedding loss function with class-aware soft margin and re-weighting to learn class-specific contexts with the benefit of textual descriptions (captions), which could help establish semantic relationships between classes, especially between the head and tail classes. Furthermore, taking into account the class imbalance, the distribution-balanced loss is adopted as the classification loss function to further improve the performance on the tail classes without compromising head classes. Extensive experiments are conducted on VOC-LT and COCO-LT datasets, which demonstrates that our method significantly surpasses the previous state-of-the-art methods and zero-shot CLIP in LTML. Our codes are fully public at https://github.com/richard-peng-xia/LMPT.
IVSep 3, 2025
Generalist versus Specialist Vision Foundation Models for Ocular Disease and OculomicsYukun Zhou, Paul Nderitu, Jocelyn Hui Lin Goh et al.
Medical foundation models, pre-trained with large-scale clinical data, demonstrate strong performance in diverse clinically relevant applications. RETFound, trained on nearly one million retinal images, exemplifies this approach in applications with retinal images. However, the emergence of increasingly powerful and multifold larger generalist foundation models such as DINOv2 and DINOv3 raises the question of whether domain-specific pre-training remains essential, and if so, what gap persists. To investigate this, we systematically evaluated the adaptability of DINOv2 and DINOv3 in retinal image applications, compared to two specialist RETFound models, RETFound-MAE and RETFound-DINOv2. We assessed performance on ocular disease detection and systemic disease prediction using two adaptation strategies: fine-tuning and linear probing. Data efficiency and adaptation efficiency were further analysed to characterise trade-offs between predictive performance and computational cost. Our results show that although scaling generalist models yields strong adaptability across diverse tasks, RETFound-DINOv2 consistently outperforms these generalist foundation models in ocular-disease detection and oculomics tasks, demonstrating stronger generalisability and data efficiency. These findings suggest that specialist retinal foundation models remain the most effective choice for clinical applications, while the narrowing gap with generalist foundation models suggests that continued data and model scaling can deliver domain-relevant gains and position them as strong foundations for future medical foundation models.
CVJun 22, 2024
TP-DRSeg: Improving Diabetic Retinopathy Lesion Segmentation with Explicit Text-Prompts Assisted SAMWenxue Li, Xinyu Xiong, Peng Xia et al.
Recent advances in large foundation models, such as the Segment Anything Model (SAM), have demonstrated considerable promise across various tasks. Despite their progress, these models still encounter challenges in specialized medical image analysis, especially in recognizing subtle inter-class differences in Diabetic Retinopathy (DR) lesion segmentation. In this paper, we propose a novel framework that customizes SAM for text-prompted DR lesion segmentation, termed TP-DRSeg. Our core idea involves exploiting language cues to inject medical prior knowledge into the vision-only segmentation network, thereby combining the advantages of different foundation models and enhancing the credibility of segmentation. Specifically, to unleash the potential of vision-language models in the recognition of medical concepts, we propose an explicit prior encoder that transfers implicit medical concepts into explicit prior knowledge, providing explainable clues to excavate low-level features associated with lesions. Furthermore, we design a prior-aligned injector to inject explicit priors into the segmentation process, which can facilitate knowledge sharing across multi-modality features and allow our framework to be trained in a parameter-efficient fashion. Experimental results demonstrate the superiority of our framework over other traditional models and foundation model variants.
CVNov 17, 2021
Hierarchical Knowledge Guided Learning for Real-world Retinal Diseases RecognitionLie Ju, Zhen Yu, Lin Wang et al.
In the real world, medical datasets often exhibit a long-tailed data distribution (i.e., a few classes occupy the majority of the data, while most classes have only a limited number of samples), which results in a challenging long-tailed learning scenario. Some recently published datasets in ophthalmology AI consist of more than 40 kinds of retinal diseases with complex abnormalities and variable morbidity. Nevertheless, more than 30 conditions are rarely seen in global patient cohorts. From a modeling perspective, most deep learning models trained on these datasets may lack the ability to generalize to rare diseases where only a few available samples are presented for training. In addition, there may be more than one disease for the presence of the retina, resulting in a challenging label co-occurrence scenario, also known as \textit{multi-label}, which can cause problems when some re-sampling strategies are applied during training. To address the above two major challenges, this paper presents a novel method that enables the deep neural network to learn from a long-tailed fundus database for various retinal disease recognition. Firstly, we exploit the prior knowledge in ophthalmology to improve the feature representation using a hierarchy-aware pre-training. Secondly, we adopt an instance-wise class-balanced sampling strategy to address the label co-occurrence issue under the long-tailed medical dataset scenario. Thirdly, we introduce a novel hybrid knowledge distillation to train a less biased representation and classifier. We conducted extensive experiments on four databases, including two public datasets and two in-house databases with more than one million fundus images. The experimental results demonstrate the superiority of our proposed methods with recognition accuracy outperforming the state-of-the-art competitors, especially for these rare diseases.
IVAug 4, 2021
Unsupervised Domain Adaptation for Retinal Vessel Segmentation with Adversarial Learning and Transfer NormalizationWei Feng, Lie Ju, Lin Wang et al.
Retinal vessel segmentation plays a key role in computer-aided screening, diagnosis, and treatment of various cardiovascular and ophthalmic diseases. Recently, deep learning-based retinal vessel segmentation algorithms have achieved remarkable performance. However, due to the domain shift problem, the performance of these algorithms often degrades when they are applied to new data that is different from the training data. Manually labeling new data for each test domain is often a time-consuming and laborious task. In this work, we explore unsupervised domain adaptation in retinal vessel segmentation by using entropy-based adversarial learning and transfer normalization layer to train a segmentation network, which generalizes well across domains and requires no annotation of the target domain. Specifically, first, an entropy-based adversarial learning strategy is developed to reduce the distribution discrepancy between the source and target domains while also achieving the objective of entropy minimization on the target domain. In addition, a new transfer normalization layer is proposed to further boost the transferability of the deep network. It normalizes the features of each domain separately to compensate for the domain distribution gap. Besides, it also adaptively selects those feature channels that are more transferable between domains, thus further enhancing the generalization performance of the network. We conducted extensive experiments on three regular fundus image datasets and an ultra-widefield fundus image dataset, and the results show that our approach yields significant performance gains compared to other state-of-the-art methods.
CVJun 18, 2021
Medical Matting: A New Perspective on Medical Segmentation with UncertaintyLin Wang, Lie Ju, Xin Wang et al.
It is difficult to accurately label ambiguous and complex shaped targets manually by binary masks. The weakness of binary mask under-expression is highlighted in medical image segmentation, where blurring is prevalent. In the case of multiple annotations, reaching a consensus for clinicians by binary masks is more challenging. Moreover, these uncertain areas are related to the lesions' structure and may contain anatomical information beneficial to diagnosis. However, current studies on uncertainty mainly focus on the uncertainty in model training and data labels. None of them investigate the influence of the ambiguous nature of the lesion itself.Inspired by image matting, this paper introduces alpha matte as a soft mask to represent uncertain areas in medical scenes and accordingly puts forward a new uncertainty quantification method to fill the gap of uncertainty research for lesion structure. In this work, we introduce a new architecture to generate binary masks and alpha mattes in a multitasking framework, which outperforms all state-of-the-art matting algorithms compared. The proposed uncertainty map is able to highlight the ambiguous regions and a novel multitasking loss weighting strategy we presented can improve performance further and demonstrate their concrete benefits. To fully-evaluate the effectiveness of our proposed method, we first labelled three medical datasets with alpha matte to address the shortage of available matting datasets in medical scenes and prove the alpha matte to be a more efficient labeling method than a binary mask from both qualitative and quantitative aspects.
CVApr 22, 2021
Relational Subsets Knowledge Distillation for Long-tailed Retinal Diseases RecognitionLie Ju, Xin Wang, Lin Wang et al.
In the real world, medical datasets often exhibit a long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples), which results in a challenging imbalance learning scenario. For example, there are estimated more than 40 different kinds of retinal diseases with variable morbidity, however with more than 30+ conditions are very rare from the global patient cohorts, which results in a typical long-tailed learning problem for deep learning-based screening models. In this study, we propose class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge, such as regions and phenotype information. It enforces the model to focus on learning the subset-specific knowledge. More specifically, there are some relational classes that reside in the fixed retinal regions, or some common pathological features are observed in both the majority and minority conditions. With those subsets learnt teacher models, then we are able to distill the multiple teacher models into a unified model with weighted knowledge distillation loss. The proposed framework proved to be effective for the long-tailed retinal diseases recognition task. The experimental results on two different datasets demonstrate that our method is flexible and can be easily plugged into many other state-of-the-art techniques with significant improvements.
CVFeb 28, 2021
Improving Medical Image Classification with Label Noise Using Dual-uncertainty EstimationLie Ju, Xin Wang, Lin Wang et al.
Deep neural networks are known to be data-driven and label noise can have a marked impact on model performance. Recent studies have shown great robustness to classic image recognition even under a high noisy rate. In medical applications, learning from datasets with label noise is more challenging since medical imaging datasets tend to have asymmetric (class-dependent) noise and suffer from high observer variability. In this paper, we systematically discuss and define the two common types of label noise in medical images - disagreement label noise from inconsistency expert opinions and single-target label noise from wrong diagnosis record. We then propose an uncertainty estimation-based framework to handle these two label noise amid the medical image classification task. We design a dual-uncertainty estimation approach to measure the disagreement label noise and single-target label noise via Direct Uncertainty Prediction and Monte-Carlo-Dropout. A boosting-based curriculum training procedure is later introduced for robust learning. We demonstrate the effectiveness of our method by conducting extensive experiments on three different diseases: skin lesions, prostate cancer, and retinal diseases. We also release a large re-engineered database that consists of annotations from more than ten ophthalmologists with an unbiased golden standard dataset for evaluation and benchmarking.
CVNov 27, 2020
Leveraging Regular Fundus Images for Training UWF Fundus Diagnosis Models via Adversarial Learning and Pseudo-LabelingLie Ju, Xin Wang, Xin Zhao et al.
Recently, ultra-widefield (UWF) 200\degree~fundus imaging by Optos cameras has gradually been introduced because of its broader insights for detecting more information on the fundus than regular 30 degree - 60 degree fundus cameras. Compared with UWF fundus images, regular fundus images contain a large amount of high-quality and well-annotated data. Due to the domain gap, models trained by regular fundus images to recognize UWF fundus images perform poorly. Hence, given that annotating medical data is labor intensive and time consuming, in this paper, we explore how to leverage regular fundus images to improve the limited UWF fundus data and annotations for more efficient training. We propose the use of a modified cycle generative adversarial network (CycleGAN) model to bridge the gap between regular and UWF fundus and generate additional UWF fundus images for training. A consistency regularization term is proposed in the loss of the GAN to improve and regulate the quality of the generated data. Our method does not require that images from the two domains be paired or even that the semantic labels be the same, which provides great convenience for data collection. Furthermore, we show that our method is robust to noise and errors introduced by the generated unlabeled data with the pseudo-labeling technique. We evaluated the effectiveness of our methods on several common fundus diseases and tasks, such as diabetic retinopathy (DR) classification, lesion detection and tessellated fundus segmentation. The experimental results demonstrate that our proposed method simultaneously achieves superior generalizability of the learned representations and performance improvements in multiple tasks.
CVMar 24, 2020
Synergic Adversarial Label Learning for Grading Retinal Diseases via Knowledge Distillation and Multi-task LearningLie Ju, Xin Wang, Xin Zhao et al.
The need for comprehensive and automated screening methods for retinal image classification has long been recognized. Well-qualified doctors annotated images are very expensive and only a limited amount of data is available for various retinal diseases such as age-related macular degeneration (AMD) and diabetic retinopathy (DR). Some studies show that AMD and DR share some common features like hemorrhagic points and exudation but most classification algorithms only train those disease models independently. Inspired by knowledge distillation where additional monitoring signals from various sources is beneficial to train a robust model with much fewer data. We propose a method called synergic adversarial label learning (SALL) which leverages relevant retinal disease labels in both semantic and feature space as additional signals and train the model in a collaborative manner. Our experiments on DR and AMD fundus image classification task demonstrate that the proposed method can significantly improve the accuracy of the model for grading diseases. In addition, we conduct additional experiments to show the effectiveness of SALL from the aspects of reliability and interpretability in the context of medical imaging application.
IVMar 23, 2020
Bridge the Domain Gap Between Ultra-wide-field and Traditional Fundus Images via Adversarial Domain AdaptationLie Ju, Xin Wang, Quan Zhou et al.
For decades, advances in retinal imaging technology have enabled effective diagnosis and management of retinal disease using fundus cameras. Recently, ultra-wide-field (UWF) fundus imaging by Optos camera is gradually put into use because of its broader insights on fundus for some lesions that are not typically seen in traditional fundus images. Research on traditional fundus images is an active topic but studies on UWF fundus images are few. One of the most important reasons is that UWF fundus images are hard to obtain. In this paper, for the first time, we explore domain adaptation from the traditional fundus to UWF fundus images. We propose a flexible framework to bridge the domain gap between two domains and co-train a UWF fundus diagnosis model by pseudo-labelling and adversarial learning. We design a regularisation technique to regulate the domain adaptation. Also, we apply MixUp to overcome the over-fitting issue from incorrect generated pseudo-labels. Our experimental results on either single or both domains demonstrate that the proposed method can well adapt and transfer the knowledge from traditional fundus images to UWF fundus images and improve the performance of retinal disease recognition.