CVMar 26, 2022Code
RSCFed: Random Sampling Consensus Federated Semi-supervised LearningXiaoxiao Liang, Yiqun Lin, Huazhu Fu et al.
Federated semi-supervised learning (FSSL) aims to derive a global model by training fully-labeled and fully-unlabeled clients or training partially labeled clients. The existing approaches work well when local clients have independent and identically distributed (IID) data but fail to generalize to a more practical FSSL setting, i.e., Non-IID setting. In this paper, we present a Random Sampling Consensus Federated learning, namely RSCFed, by considering the uneven reliability among models from fully-labeled clients, fully-unlabeled clients or partially labeled clients. Our key motivation is that given models with large deviations from either labeled clients or unlabeled clients, the consensus could be reached by performing random sub-sampling over clients. To achieve it, instead of directly aggregating local models, we first distill several sub-consensus models by random sub-sampling over clients and then aggregating the sub-consensus models to the global model. To enhance the robustness of sub-consensus models, we also develop a novel distance-reweighted model aggregation method. Experimental results show that our method outperforms state-of-the-art methods on three benchmarked datasets, including both natural and medical images. The code is available at https://github.com/XMed-Lab/RSCFed.
IVMar 12, 2023Code
Learning Deep Intensity Field for Extremely Sparse-View CBCT ReconstructionYiqun Lin, Zhongjin Luo, Wei Zhao et al.
Sparse-view cone-beam CT (CBCT) reconstruction is an important direction to reduce radiation dose and benefit clinical applications. Previous voxel-based generation methods represent the CT as discrete voxels, resulting in high memory requirements and limited spatial resolution due to the use of 3D decoders. In this paper, we formulate the CT volume as a continuous intensity field and develop a novel DIF-Net to perform high-quality CBCT reconstruction from extremely sparse (fewer than 10) projection views at an ultrafast speed. The intensity field of a CT can be regarded as a continuous function of 3D spatial points. Therefore, the reconstruction can be reformulated as regressing the intensity value of an arbitrary 3D point from given sparse projections. Specifically, for a point, DIF-Net extracts its view-specific features from different 2D projection views. These features are subsequently aggregated by a fusion module for intensity estimation. Notably, thousands of points can be processed in parallel to improve efficiency during training and testing. In practice, we collect a knee CBCT dataset to train and evaluate DIF-Net. Extensive experiments show that our approach can reconstruct CBCT with high image quality and high spatial resolution from extremely sparse views within 1.6 seconds, significantly outperforming state-of-the-art methods. Our code will be available at https://github.com/xmed-lab/DIF-Net.
IVNov 3, 2022Code
ImageCAS: A Large-Scale Dataset and Benchmark for Coronary Artery Segmentation based on Computed Tomography Angiography ImagesAn Zeng, Chunbiao Wu, Meiping Huang et al.
Cardiovascular disease (CVD) accounts for about half of non-communicable diseases. Vessel stenosis in the coronary artery is considered to be the major risk of CVD. Computed tomography angiography (CTA) is one of the widely used noninvasive imaging modalities in coronary artery diagnosis due to its superior image resolution. Clinically, segmentation of coronary arteries is essential for the diagnosis and quantification of coronary artery disease. Recently, a variety of works have been proposed to address this problem. However, on one hand, most works rely on in-house datasets, and only a few works published their datasets to the public which only contain tens of images. On the other hand, their source code have not been published, and most follow-up works have not made comparison with existing works, which makes it difficult to judge the effectiveness of the methods and hinders the further exploration of this challenging yet critical problem in the community. In this paper, we propose a large-scale dataset for coronary artery segmentation on CTA images. In addition, we have implemented a benchmark in which we have tried our best to implement several typical existing methods. Furthermore, we propose a strong baseline method which combines multi-scale patch fusion and two-stage processing to extract the details of vessels. Comprehensive experiments show that the proposed method achieves better performance than existing works on the proposed large-scale dataset. The benchmark and the dataset are published at https://github.com/XiaoweiXu/ImageCAS-A-Large-Scale-Dataset-and-Benchmark-for-Coronary-Artery-Segmentation-based-on-CT.
CVAug 23, 2023Code
CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say NoHualiang Wang, Yi Li, Huifeng Yao et al.
Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD detection methods driven by CLIP, which only require class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers the logic of saying no within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples using positive-semantic prompts and negation-semantic prompts. Specifically, we design a novel learnable no prompt and a no text encoder to capture negation semantics within images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with no prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection by utilizing negation semantics from no prompts and the text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN, based on ViT-B-16, outperforms 7 well-used algorithms by at least 2.34% and 11.64% in terms of AUROC and FPR95 for zero-shot OOD detection on ImageNet-1K. Our CLIPN can serve as a solid foundation for effectively leveraging CLIP in downstream OOD tasks. The code is available on https://github.com/xmed-lab/CLIPN.
CVMay 19, 2022Code
Free Lunch for Surgical Video Understanding by Distilling Self-SupervisionsXinpeng Ding, Ziwei Liu, Xiaomeng Li
Self-supervised learning has witnessed great progress in vision and NLP; recently, it also attracted much attention to various medical imaging modalities such as X-ray, CT, and MRI. Existing methods mostly focus on building new pretext self-supervision tasks such as reconstruction, orientation, and masking identification according to the properties of medical images. However, the publicly available self-supervision models are not fully exploited. In this paper, we present a powerful yet efficient self-supervision framework for surgical video understanding. Our key insight is to distill knowledge from publicly available models trained on large generic datasets4 to facilitate the self-supervised learning of surgical videos. To this end, we first introduce a semantic-preserving training scheme to obtain our teacher model, which not only contains semantics from the publicly available models, but also can produce accurate knowledge for surgical data. Besides training with only contrastive learning, we also introduce a distillation objective to transfer the rich learned information from the teacher model to self-supervised learning on surgical data. Extensive experiments on two surgical phase recognition benchmarks show that our framework can significantly improve the performance of existing self-supervised learning methods. Notably, our framework demonstrates a compelling advantage under a low-data regime. Our code is available at https://github.com/xmed-lab/DistillingSelf.
CVJul 27, 2023Code
Federated Model Aggregation via Self-Supervised Priors for Highly Imbalanced Medical Image ClassificationMarawan Elbatel, Hualiang Wang, Robert Martí et al.
In the medical field, federated learning commonly deals with highly imbalanced datasets, including skin lesions and gastrointestinal images. Existing federated methods under highly imbalanced datasets primarily focus on optimizing a global model without incorporating the intra-class variations that can arise in medical imaging due to different populations, findings, and scanners. In this paper, we study the inter-client intra-class variations with publicly available self-supervised auxiliary networks. Specifically, we find that employing a shared auxiliary pre-trained model, like MoCo-V2, locally on every client yields consistent divergence measurements. Based on these findings, we derive a dynamic balanced model aggregation via self-supervised priors (MAS) to guide the global model optimization. Fed-MAS can be utilized with different local learning methods for effective model aggregation toward a highly robust and unbiased global model. Our code is available at \url{https://github.com/xmed-lab/Fed-MAS}.
CVApr 12, 2023Code
A Closer Look at the Explainability of Contrastive Language-Image Pre-trainingYi Li, Hualiang Wang, Yiqun Duan et al.
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.
IVJul 22, 2023Code
DHC: Dual-debiased Heterogeneous Co-training Framework for Class-imbalanced Semi-supervised Medical Image SegmentationHaonan Wang, Xiaomeng Li
The volume-wise labeling of 3D medical images is expertise-demanded and time-consuming; hence semi-supervised learning (SSL) is highly desirable for training with limited labeled data. Imbalanced class distribution is a severe problem that bottlenecks the real-world application of these methods but was not addressed much. Aiming to solve this issue, we present a novel Dual-debiased Heterogeneous Co-training (DHC) framework for semi-supervised 3D medical image segmentation. Specifically, we propose two loss weighting strategies, namely Distribution-aware Debiased Weighting (DistDW) and Difficulty-aware Debiased Weighting (DiffDW), which leverage the pseudo labels dynamically to guide the model to solve data and learning biases. The framework improves significantly by co-training these two diverse and accurate sub-models. We also introduce more representative benchmarks for class-imbalanced semi-supervised medical image segmentation, which can fully demonstrate the efficacy of the class-imbalance designs. Experiments show that our proposed framework brings significant improvements by using pseudo labels for debiasing and alleviating the class imbalance problem. More importantly, our method outperforms the state-of-the-art SSL methods, demonstrating the potential of our framework for the more challenging SSL setting. Code and models are available at: https://github.com/xmed-lab/DHC.
CVApr 15, 2023Code
Compete to Win: Enhancing Pseudo Labels for Barely-supervised Medical Image SegmentationHuimin Wu, Xiaomeng Li, Yiqun Lin et al.
This study investigates barely-supervised medical image segmentation where only few labeled data, i.e., single-digit cases are available. We observe the key limitation of the existing state-of-the-art semi-supervised solution cross pseudo supervision is the unsatisfactory precision of foreground classes, leading to a degenerated result under barely-supervised learning. In this paper, we propose a novel Compete-to-Win method (ComWin) to enhance the pseudo label quality. In contrast to directly using one model's predictions as pseudo labels, our key idea is that high-quality pseudo labels should be generated by comparing multiple confidence maps produced by different networks to select the most confident one (a compete-to-win strategy). To further refine pseudo labels at near-boundary areas, an enhanced version of ComWin, namely, ComWin+, is proposed by integrating a boundary-aware enhancement module. Experiments show that our method can achieve the best performance on three public medical image datasets for cardiac structure segmentation, pancreas segmentation and colon tumor segmentation, respectively. The source code is now available at https://github.com/Huiimin5/comwin.
CVOct 20, 2022Code
Cyclical Self-Supervision for Semi-Supervised Ejection Fraction Prediction from Echocardiogram VideosWeihang Dai, Xiaomeng Li, Xinpeng Ding et al.
Left-ventricular ejection fraction (LVEF) is an important indicator of heart failure. Existing methods for LVEF estimation from video require large amounts of annotated data to achieve high performance, e.g. using 10,030 labeled echocardiogram videos to achieve mean absolute error (MAE) of 4.10. Labeling these videos is time-consuming however and limits potential downstream applications to other heart diseases. This paper presents the first semi-supervised approach for LVEF prediction. Unlike general video prediction tasks, LVEF prediction is specifically related to changes in the left ventricle (LV) in echocardiogram videos. By incorporating knowledge learned from predicting LV segmentations into LVEF regression, we can provide additional context to the model for better predictions. To this end, we propose a novel Cyclical Self-Supervision (CSS) method for learning video-based LV segmentation, which is motivated by the observation that the heartbeat is a cyclical process with temporal repetition. Prediction masks from our segmentation model can then be used as additional input for LVEF regression to provide spatial context for the LV region. We also introduce teacher-student distillation to distill the information from LV segmentation masks into an end-to-end LVEF regression model that only requires video inputs. Results show our method outperforms alternative semi-supervised methods and can achieve MAE of 4.17, which is competitive with state-of-the-art supervised performance, using half the number of labels. Validation on an external dataset also shows improved generalization ability from using our method. Our code is available at https://github.com/xmed-lab/CSS-SemiVideo.
CVFeb 15, 2023Code
Semi-Supervised Deep Regression with Uncertainty Consistency and Variational Model Ensembling via Bayesian Neural NetworksWeihang Dai, Xiaomeng Li, Kwang-Ting Cheng
Deep regression is an important problem with numerous applications. These range from computer vision tasks such as age estimation from photographs, to medical tasks such as ejection fraction estimation from echocardiograms for disease tracking. Semi-supervised approaches for deep regression are notably under-explored compared to classification and segmentation tasks, however. Unlike classification tasks, which rely on thresholding functions for generating class pseudo-labels, regression tasks use real number target predictions directly as pseudo-labels, making them more sensitive to prediction quality. In this work, we propose a novel approach to semi-supervised regression, namely Uncertainty-Consistent Variational Model Ensembling (UCVME), which improves training by generating high-quality pseudo-labels and uncertainty estimates for heteroscedastic regression. Given that aleatoric uncertainty is only dependent on input data by definition and should be equal for the same inputs, we present a novel uncertainty consistency loss for co-trained models. Our consistency loss significantly improves uncertainty estimates and allows higher quality pseudo-labels to be assigned greater importance under heteroscedastic regression. Furthermore, we introduce a novel variational model ensembling approach to reduce prediction noise and generate more robust pseudo-labels. We analytically show our method generates higher quality targets for unlabeled data and further improves training. Experiments show that our method outperforms state-of-the-art alternatives on different tasks and can be competitive with supervised methods that use full labels. Our code is available at https://github.com/xmed-lab/UCVME.
CVApr 18, 2022Code
CgAT: Center-Guided Adversarial Training for Deep Hashing-Based RetrievalXunguang Wang, Yiqun Lin, Xiaomeng Li
Deep hashing has been extensively utilized in massive image retrieval because of its efficiency and effectiveness. However, deep hashing models are vulnerable to adversarial examples, making it essential to develop adversarial defense methods for image retrieval. Existing solutions achieved limited defense performance because of using weak adversarial samples for training and lacking discriminative optimization objectives to learn robust features. In this paper, we present a min-max based Center-guided Adversarial Training, namely CgAT, to improve the robustness of deep hashing networks through worst adversarial examples. Specifically, we first formulate the center code as a semantically-discriminative representative of the input image content, which preserves the semantic similarity with positive samples and dissimilarity with negative examples. We prove that a mathematical formula can calculate the center code immediately. After obtaining the center codes in each optimization iteration of the deep hashing network, they are adopted to guide the adversarial training process. On the one hand, CgAT generates the worst adversarial examples as augmented data by maximizing the Hamming distance between the hash codes of the adversarial examples and the center codes. On the other hand, CgAT learns to mitigate the effects of adversarial samples by minimizing the Hamming distance to the center codes. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our adversarial training algorithm in defending against adversarial attacks for deep hashing-based retrieval. Compared with the current state-of-the-art defense method, we significantly improve the defense performance by an average of 18.61\%, 12.35\%, and 11.56\% on FLICKR-25K, NUS-WIDE, and MS-COCO, respectively. The code is available at https://github.com/xunguangwang/CgAT.
CVJun 14, 2022Code
Online Easy Example Mining for Weakly-supervised Gland Segmentation from Histology ImagesYi Li, Yiduo Yu, Yiwen Zou et al.
Developing an AI-assisted gland segmentation method from histology images is critical for automatic cancer diagnosis and prognosis; however, the high cost of pixel-level annotations hinders its applications to broader diseases. Existing weakly-supervised semantic segmentation methods in computer vision achieve degenerative results for gland segmentation, since the characteristics and problems of glandular datasets are different from general object datasets. We observe that, unlike natural images, the key problem with histology images is the confusion of classes owning to morphological homogeneity and low color contrast among different tissues. To this end, we propose a novel method Online Easy Example Mining (OEEM) that encourages the network to focus on credible supervision signals rather than noisy signals, therefore mitigating the influence of inevitable false predictions in pseudo-masks. According to the characteristics of glandular datasets, we design a strong framework for gland segmentation. Our results exceed many fully-supervised methods and weakly-supervised methods for gland segmentation over 4.4% and 6.04% at mIoU, respectively. Code is available at https://github.com/xmed-lab/OEEM.
IVAug 1, 2023Code
Fundus-Enhanced Disease-Aware Distillation Model for Retinal Disease Classification from OCT ImagesLehan Wang, Weihang Dai, Mei Jin et al.
Optical Coherence Tomography (OCT) is a novel and effective screening tool for ophthalmic examination. Since collecting OCT images is relatively more expensive than fundus photographs, existing methods use multi-modal learning to complement limited OCT data with additional context from fundus images. However, the multi-modal framework requires eye-paired datasets of both modalities, which is impractical for clinical use. To address this problem, we propose a novel fundus-enhanced disease-aware distillation model (FDDM), for retinal disease classification from OCT images. Our framework enhances the OCT model during training by utilizing unpaired fundus images and does not require the use of fundus images during testing, which greatly improves the practicality and efficiency of our method for clinical use. Specifically, we propose a novel class prototype matching to distill disease-related information from the fundus model to the OCT model and a novel class similarity alignment to enforce consistency between disease distribution of both modalities. Experimental results show that our proposed approach outperforms single-modal, multi-modal, and state-of-the-art distillation methods for retinal disease classification. Code is available at https://github.com/xmed-lab/FDDM.
CVSep 20, 2023Code
GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video SegmentationJiewen Yang, Xinpeng Ding, Ziyang Zheng et al.
Echocardiogram video segmentation plays an important role in cardiac disease diagnosis. This paper studies the unsupervised domain adaption (UDA) for echocardiogram video segmentation, where the goal is to generalize the model trained on the source domain to other unlabelled target domains. Existing UDA segmentation methods are not suitable for this task because they do not model local information and the cyclical consistency of heartbeat. In this paper, we introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for cardiac structure segmentation. Our GraphEcho comprises two innovative modules, the Spatial-wise Cross-domain Graph Matching (SCGM) and the Temporal Cycle Consistency (TCC) module, which utilize prior knowledge of echocardiogram videos, i.e., consistent cardiac structure across patients and centers and the heartbeat cyclical consistency, respectively. These two modules can better align global and local features from source and target domains, improving UDA segmentation results. Experimental results showed that our GraphEcho outperforms existing state-of-the-art UDA segmentation methods. Our collected dataset and code will be publicly released upon acceptance. This work will lay a new and solid cornerstone for cardiac structure segmentation from echocardiogram videos. Code and dataset are available at: https://github.com/xmed-lab/GraphEcho
CVAug 15, 2023Code
Context-Aware Pseudo-Label Refinement for Source-Free Domain Adaptive Fundus Image SegmentationZheang Huai, Xinpeng Ding, Yi Li et al.
In the domain adaptation problem, source data may be unavailable to the target client side due to privacy or intellectual property issues. Source-free unsupervised domain adaptation (SF-UDA) aims at adapting a model trained on the source side to align the target distribution with only the source model and unlabeled target data. The source model usually produces noisy and context-inconsistent pseudo-labels on the target domain, i.e., neighbouring regions that have a similar visual appearance are annotated with different pseudo-labels. This observation motivates us to refine pseudo-labels with context relations. Another observation is that features of the same class tend to form a cluster despite the domain gap, which implies context relations can be readily calculated from feature distances. To this end, we propose a context-aware pseudo-label refinement method for SF-UDA. Specifically, a context-similarity learning module is developed to learn context relations. Next, pseudo-label revision is designed utilizing the learned context relations. Further, we propose calibrating the revised pseudo-labels to compensate for wrong revision caused by inaccurate context relations. Additionally, we adopt a pixel-level and class-level denoising scheme to select reliable pseudo-labels for domain adaptation. Experiments on cross-domain fundus images indicate that our approach yields the state-of-the-art results. Code is available at https://github.com/xmed-lab/CPR.
CVAug 14, 2023Code
Radiomics-Informed Deep Learning for Classification of Atrial Fibrillation Sub-Types from Left-Atrium CT VolumesWeihang Dai, Xiaomeng Li, Taihui Yu et al.
Atrial Fibrillation (AF) is characterized by rapid, irregular heartbeats, and can lead to fatal complications such as heart failure. The disease is divided into two sub-types based on severity, which can be automatically classified through CT volumes for disease screening of severe cases. However, existing classification approaches rely on generic radiomic features that may not be optimal for the task, whilst deep learning methods tend to over-fit to the high-dimensional volume inputs. In this work, we propose a novel radiomics-informed deep-learning method, RIDL, that combines the advantages of deep learning and radiomic approaches to improve AF sub-type classification. Unlike existing hybrid techniques that mostly rely on naïve feature concatenation, we observe that radiomic feature selection methods can serve as an information prior, and propose supplementing low-level deep neural network (DNN) features with locally computed radiomic features. This reduces DNN over-fitting and allows local variations between radiomic features to be better captured. Furthermore, we ensure complementary information is learned by deep and radiomic features by designing a novel feature de-correlation loss. Combined, our method addresses the limitations of deep learning and radiomic approaches and outperforms state-of-the-art radiomic, deep learning, and hybrid approaches, achieving 86.9% AUC for the AF sub-type classification task. Code is available at https://github.com/xmed-lab/RIDL.
IVSep 30, 2022Code
Semi-Supervised Domain Generalization for Cardiac Magnetic Resonance Image Segmentation with High Quality Pseudo LabelsWanqin Ma, Huifeng Yao, Yiqun Lin et al.
Developing a deep learning method for medical segmentation tasks heavily relies on a large amount of labeled data. However, the annotations require professional knowledge and are limited in number. Recently, semi-supervised learning has demonstrated great potential in medical segmentation tasks. Most existing methods related to cardiac magnetic resonance images only focus on regular images with similar domains and high image quality. A semi-supervised domain generalization method was developed in [2], which enhances the quality of pseudo labels on varied datasets. In this paper, we follow the strategy in [2] and present a domain generalization method for semi-supervised medical segmentation. Our main goal is to improve the quality of pseudo labels under extreme MRI Analysis with various domains. We perform Fourier transformation on input images to learn low-level statistics and cross-domain information. Then we feed the augmented images as input to the double cross pseudo supervision networks to calculate the variance among pseudo labels. We evaluate our method on the CMRxMotion dataset [1]. With only partially labeled data and without domain labels, our approach consistently generates accurate segmentation results of cardiac magnetic resonance images with different respiratory motions. Code is available at: https://github.com/MAWanqin2002/STACOM2022Ma
CVSep 20, 2023Code
GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video SegmentationZiyang Zheng, Jiewen Yang, Xinpeng Ding et al.
Cardiac structure segmentation from echocardiogram videos plays a crucial role in diagnosing heart disease. The combination of multi-view echocardiogram data is essential to enhance the accuracy and robustness of automated methods. However, due to the visual disparity of the data, deriving cross-view context information remains a challenging task, and unsophisticated fusion strategies can even lower performance. In this study, we propose a novel Gobal-Local fusion (GL-Fusion) network to jointly utilize multi-view information globally and locally that improve the accuracy of echocardiogram analysis. Specifically, a Multi-view Global-based Fusion Module (MGFM) is proposed to extract global context information and to explore the cyclic relationship of different heartbeat cycles in an echocardiogram video. Additionally, a Multi-view Local-based Fusion Module (MLFM) is designed to extract correlations of cardiac structures from different views. Furthermore, we collect a multi-view echocardiogram video dataset (MvEVD) to evaluate our method. Our method achieves an 82.29% average dice score, which demonstrates a 7.83% improvement over the baseline method, and outperforms other existing state-of-the-art methods. To our knowledge, this is the first exploration of a multi-view method for echocardiogram video segmentation. Code available at: https://github.com/xmed-lab/GL-Fusion
CVJul 22, 2023Code
FSDiffReg: Feature-wise and Score-wise Diffusion-guided Unsupervised Deformable Image Registration for Cardiac ImagesYi Qin, Xiaomeng Li
Unsupervised deformable image registration is one of the challenging tasks in medical imaging. Obtaining a high-quality deformation field while preserving deformation topology remains demanding amid a series of deep-learning-based solutions. Meanwhile, the diffusion model's latent feature space shows potential in modeling the deformation semantics. To fully exploit the diffusion model's ability to guide the registration task, we present two modules: Feature-wise Diffusion-Guided Module (FDG) and Score-wise Diffusion-Guided Module (SDG). Specifically, FDG uses the diffusion model's multi-scale semantic features to guide the generation of the deformation field. SDG uses the diffusion score to guide the optimization process for preserving deformation topology with barely any additional computation. Experiment results on the 3D medical cardiac image registration task validate our model's ability to provide refined deformation fields with preserved topology effectively. Code is available at: https://github.com/xmed-lab/FSDiffReg.git.
72.9CVMar 29
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model DevelopmentZhongying Deng, Cheng Tang, Ziyan Huang et al. · pku
Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.
IVApr 13, 2022
WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung AdenocarcinomaChu Han, Xipeng Pan, Lixu Yan et al.
Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient pixel-level annotations, which is time-consuming and expensive. To enrich the label resources of LUAD and to alleviate the annotation efforts, we organize this challenge WSSS4LUAD to call for the outstanding weakly-supervised semantic segmentation (WSSS) techniques for histopathology images of LUAD. Participants have to design the algorithm to segment tumor epithelial, tumor-associated stroma and normal tissue with only patch-level labels. This challenge includes 10,091 patch-level annotations (the training set) and over 130 million labeled pixels (the validation and test sets), from 87 WSIs (67 from GDPH, 20 from TCGA). All the labels were generated by a pathologist-in-the-loop pipeline with the help of AI models and checked by the label review board. Among 532 registrations, 28 teams submitted the results in the test phase with over 1,000 submissions. Finally, the first place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919). According to the technical reports of the top-tier teams, CAM is still the most popular approach in WSSS. Cutmix data augmentation has been widely adopted to generate more reliable samples. With the success of this challenge, we believe that WSSS approaches with patch-level annotations can be a complement to the traditional pixel annotations while reducing the annotation efforts. The entire dataset has been released to encourage more researches on computational pathology in LUAD and more novel WSSS techniques.
IVOct 17, 2023Code
Towards Generic Semi-Supervised Framework for Volumetric Medical Image SegmentationHaonan Wang, Xiaomeng Li
Volume-wise labeling in 3D medical images is a time-consuming task that requires expertise. As a result, there is growing interest in using semi-supervised learning (SSL) techniques to train models with limited labeled data. However, the challenges and practical applications extend beyond SSL to settings such as unsupervised domain adaptation (UDA) and semi-supervised domain generalization (SemiDG). This work aims to develop a generic SSL framework that can handle all three settings. We identify two main obstacles to achieving this goal in the existing SSL framework: 1) the weakness of capturing distribution-invariant features; and 2) the tendency for unlabeled data to be overwhelmed by labeled data, leading to over-fitting to the labeled data during training. To address these issues, we propose an Aggregating & Decoupling framework. The aggregating part consists of a Diffusion encoder that constructs a common knowledge set by extracting distribution-invariant features from aggregated information from multiple distributions/domains. The decoupling part consists of three decoders that decouple the training process with labeled and unlabeled data, thus avoiding over-fitting to labeled data, specific domains and classes. We evaluate our proposed framework on four benchmark datasets for SSL, Class-imbalanced SSL, UDA and SemiDG. The results showcase notable improvements compared to state-of-the-art methods across all four settings, indicating the potential of our framework to tackle more challenging SSL scenarios. Code and models are available at: https://github.com/xmed-lab/GenericSSL.
CVJul 30, 2022
Learning Shadow Correspondence for Video Shadow DetectionXinpeng Ding, Jingweng Yang, Xiaowei Hu et al.
Video shadow detection aims to generate consistent shadow predictions among video frames. However, the current approaches suffer from inconsistent shadow predictions across frames, especially when the illumination and background textures change in a video. We make an observation that the inconsistent predictions are caused by the shadow feature inconsistency, i.e., the features of the same shadow regions show dissimilar proprieties among the nearby frames.In this paper, we present a novel Shadow-Consistent Correspondence method (SC-Cor) to enhance pixel-wise similarity of the specific shadow regions across frames for video shadow detection. Our proposed SC-Cor has three main advantages. Firstly, without requiring the dense pixel-to-pixel correspondence labels, SC-Cor can learn the pixel-wise correspondence across frames in a weakly-supervised manner. Secondly, SC-Cor considers intra-shadow separability, which is robust to the variant textures and illuminations in videos. Finally, SC-Cor is a plug-and-play module that can be easily integrated into existing shadow detectors with no extra computational cost. We further design a new evaluation metric to evaluate the temporal stability of the video shadow detection results. Experimental results show that SC-Cor outperforms the prior state-of-the-art method, by 6.51% on IoU and 3.35% on the newly introduced temporal stability metric.
CVMay 7, 2022
Calibrating Label Distribution for Class-Imbalanced Barely-Supervised Knee SegmentationYiqun Lin, Huifeng Yao, Zezhong Li et al.
Segmentation of 3D knee MR images is important for the assessment of osteoarthritis. Like other medical data, the volume-wise labeling of knee MR images is expertise-demanded and time-consuming; hence semi-supervised learning (SSL), particularly barely-supervised learning, is highly desirable for training with insufficient labeled data. We observed that the class imbalance problem is severe in the knee MR images as the cartilages only occupy 6% of foreground volumes, and the situation becomes worse without sufficient labeled data. To address the above problem, we present a novel framework for barely-supervised knee segmentation with noisy and imbalanced labels. Our framework leverages label distribution to encourage the network to put more effort into learning cartilage parts. Specifically, we utilize 1.) label quantity distribution for modifying the objective loss function to a class-aware weighted form and 2.) label position distribution for constructing a cropping probability mask to crop more sub-volumes in cartilage areas from both labeled and unlabeled inputs. In addition, we design dual uncertainty-aware sampling supervision to enhance the supervision of low-confident categories for efficient unsupervised learning. Experiments show that our proposed framework brings significant improvements by incorporating the unlabeled data and alleviating the problem of class imbalance. More importantly, our method outperforms the state-of-the-art SSL methods, demonstrating the potential of our framework for the more challenging SSL setting.
31.7CVMay 23Code
ULF-Synth: Physics-Guided Ultra-Low-Field MRI Enhancement for Pediatric NeuroimagingToufiq Musah, Salvatore Calcagno, Federica Proietto Salanitri et al.
Ultra-low-field (ULF) MRI offers portable and accessible neuroimaging but suffers from reduced signal-to-noise ratio and limited spatial resolution compared to high-field (HF) systems. Acquiring paired ULF-HF data for supervised enhancement is often difficult, particularly in resource-limited settings. We introduce ULF-Synth, a framework that combines: (i) acquisition-based synthesis of realistic ULF images from HF volumes to create large-scale paired training data, (ii) a spatial-frequency domain objective that prioritizes recovery of high-frequency anatomical detail. This formulation is architecture-agnostic, consistently improving structural similarity and perceptual fidelity across encoder-decoder, adversarial, and diffusion-based translation models. When trained exclusively on synthetic data, the resulting models generalize effectively to real 64mT ULF acquisitions, improving downstream multiclass brain segmentation and achieving higher radiologist preference and diagnostic acceptability in a blinded reader study. These findings demonstrate that synthetic paired supervision provides a practical and scalable pathway for enhancing ULF MRI without requiring real paired acquisitions. Code, Models and Dataset: https://github.com/toufiqmusah/ULF-Synth
CVAug 7, 2024Code
Teach CLIP to Develop a Number Sense for Ordinal RegressionYao Du, Qiang Zhai, Weihang Dai et al.
Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at https://github.com/xmed-lab/NumCLIP.
CVSep 11, 2023
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous DrivingXinpeng Ding, Jianhua Han, Hang Xu et al.
Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.
CVNov 6, 2025Code
MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark DetectionMarawan Elbatel, Anbang Wang, Keyuan Liu et al.
This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .
CVJul 3, 2022
Dynamic Sub-Cluster-Aware Network for Few-Shot Skin Disease ClassificationShuhan LI, Xiaomeng Li, Xiaowei Xu et al.
This paper addresses the problem of few-shot skin disease classification by introducing a novel approach called the Sub-Cluster-Aware Network (SCAN) that enhances accuracy in diagnosing rare skin diseases. The key insight motivating the design of SCAN is the observation that skin disease images within a class often exhibit multiple sub-clusters, characterized by distinct variations in appearance. To improve the performance of few-shot learning, we focus on learning a high-quality feature encoder that captures the unique sub-clustered representations within each disease class, enabling better characterization of feature distributions. Specifically, SCAN follows a dual-branch framework, where the first branch learns class-wise features to distinguish different skin diseases, and the second branch aims to learn features which can effectively partition each class into several groups so as to preserve the sub-clustered structure within each class. To achieve the objective of the second branch, we present a cluster loss to learn image similarities via unsupervised clustering. To ensure that the samples in each sub-cluster are from the same class, we further design a purity loss to refine the unsupervised clustering results. We evaluate the proposed approach on two public datasets for few-shot skin disease classification. The experimental results validate that our framework outperforms the state-of-the-art methods by around 2% to 5% in terms of sensitivity, specificity, accuracy, and F1-score on the SD-198 and Derm7pt datasets.
LGJul 14, 2024Code
Learning Unlabeled Clients Divergence for Federated Semi-Supervised Learning via Anchor Model AggregationMarawan Elbatel, Hualiang Wang, Jixiang Chen et al.
Federated semi-supervised learning (FedSemi) refers to scenarios where there may be clients with fully labeled data, clients with partially labeled, and even fully unlabeled clients while preserving data privacy. However, challenges arise from client drift due to undefined heterogeneous class distributions and erroneous pseudo-labels. Existing FedSemi methods typically fail to aggregate models from unlabeled clients due to their inherent unreliability, thus overlooking unique information from their heterogeneous data distribution, leading to sub-optimal results. In this paper, we enable unlabeled client aggregation through SemiAnAgg, a novel Semi-supervised Anchor-Based federated Aggregation. SemiAnAgg learns unlabeled client contributions via an anchor model, effectively harnessing their informative value. Our key idea is that by feeding local client data to the same global model and the same consistently initialized anchor model (i.e., random model), we can measure the importance of each unlabeled client accordingly. Extensive experiments demonstrate that SemiAnAgg achieves new state-of-the-art results on four widely used FedSemi benchmarks, leading to substantial performance improvements: a 9% increase in accuracy on CIFAR-100 and a 7.6% improvement in recall on the medical dataset ISIC-18, compared with prior state-of-the-art. Code is available at: https://github.com/xmed-lab/SemiAnAgg.
CVJul 3, 2024Code
An Organism Starts with a Single Pix-Cell: A Neural Cellular Diffusion for High-Resolution Image SynthesisMarawan Elbatel, Konstantinos Kamnitsas, Xiaomeng Li
Generative modeling seeks to approximate the statistical properties of real data, enabling synthesis of new data that closely resembles the original distribution. Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPMs) represent significant advancements in generative modeling, drawing inspiration from game theory and thermodynamics, respectively. Nevertheless, the exploration of generative modeling through the lens of biological evolution remains largely untapped. In this paper, we introduce a novel family of models termed Generative Cellular Automata (GeCA), inspired by the evolution of an organism from a single cell. GeCAs are evaluated as an effective augmentation tool for retinal disease classification across two imaging modalities: Fundus and Optical Coherence Tomography (OCT). In the context of OCT imaging, where data is scarce and the distribution of classes is inherently skewed, GeCA significantly boosts the performance of 11 different ophthalmological conditions, achieving a 12% increase in the average F1 score compared to conventional baselines. GeCAs outperform both diffusion methods that incorporate UNet or state-of-the art variants with transformer-based denoising models, under similar parameter constraints. Code is available at: https://github.com/xmed-lab/GeCA.
37.7CVMay 15Code
TriALS: Triphasic-Aided Liver Lesion Segmentation Benchmark in Non-Contrast CTMarawan Elbatel, Mohamed Ghonim, Jiaji Mao et al.
Automated segmentation of liver lesions on non-contrast computed tomography (NCCT) is clinically important but fundamentally challenging, particularly in low-resource settings across Africa and Asia where contrast agents are frequently unavailable. Progress has been limited by the absence of annotated NCCT benchmarks. Here we describe the TriALS challenge for automated liver lesion segmentation under contrast-limited conditions, supported by a multi-centre dataset of 150 cases with four-phase CT acquisitions (600 volumes) from Egyptian and Chinese institutions. Algorithms were evaluated on 70 cases from three institutions, including an independent external cohort. The top-performing method achieved a mean venous-phase Dice of 0.754, consistent with human-level performance, yet dropped to 0.57 on NCCT. On external validation, the leading method outperformed off-the-shelf models by up to 28% in Dice on NCCT. Algorithm performance was most strongly predicted by training data scale and pre-training strategy. A cross-year comparison exposed a persistent perceptual barrier on NCCT that scaling pre-training alone cannot overcome. Data, annotations, and code are available at https://github.com/xmed-lab/TriALS.
83.9CLApr 3Code
Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text DecodingYuchen Wang, Haonan Wang, Yu Guo et al.
Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at https://github.com/xmed-lab/SemKey.
IVJul 12, 2024Code
FD-SOS: Vision-Language Open-Set Detectors for Bone Fenestration and Dehiscence Detection from Intraoral ImagesMarawan Elbatel, Keyuan Liu, Yanqi Yang et al.
Accurate detection of bone fenestration and dehiscence (FD) is crucial for effective treatment planning in dentistry. While cone-beam computed tomography (CBCT) is the gold standard for evaluating FD, it comes with limitations such as radiation exposure, limited accessibility, and higher cost compared to intraoral images. In intraoral images, dentists face challenges in the differential diagnosis of FD. This paper presents a novel and clinically significant application of FD detection solely from intraoral images. To achieve this, we propose FD-SOS, a novel open-set object detector for FD detection from intraoral images. FD-SOS has two novel components: conditional contrastive denoising (CCDN) and teeth-specific matching assignment (TMA). These modules enable FD-SOS to effectively leverage external dental semantics. Experimental results showed that our method outperformed existing detection methods and surpassed dental professionals by 35% recall under the same level of precision. Code is available at: https://github.com/xmed-lab/FD-SOS.
IVSep 17, 2024
PSFHS Challenge Report: Pubic Symphysis and Fetal Head Segmentation from Intrapartum Ultrasound ImagesJieyun Bai, Zihao Zhou, Zhanhong Ou et al.
Segmentation of the fetal and maternal structures, particularly intrapartum ultrasound imaging as advocated by the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) for monitoring labor progression, is a crucial first step for quantitative diagnosis and clinical decision-making. This requires specialized analysis by obstetrics professionals, in a task that i) is highly time- and cost-consuming and ii) often yields inconsistent results. The utility of automatic segmentation algorithms for biometry has been proven, though existing results remain suboptimal. To push forward advancements in this area, the Grand Challenge on Pubic Symphysis-Fetal Head Segmentation (PSFHS) was held alongside the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023). This challenge aimed to enhance the development of automatic segmentation algorithms at an international scale, providing the largest dataset to date with 5,101 intrapartum ultrasound images collected from two ultrasound machines across three hospitals from two institutions. The scientific community's enthusiastic participation led to the selection of the top 8 out of 179 entries from 193 registrants in the initial phase to proceed to the competition's second stage. These algorithms have elevated the state-of-the-art in automatic PSFHS from intrapartum ultrasound images. A thorough analysis of the results pinpointed ongoing challenges in the field and outlined recommendations for future work. The top solutions and the complete dataset remain publicly available, fostering further advancements in automatic segmentation and biometry for intrapartum ultrasound imaging.
LGDec 10, 2025Code
Contrastive Learning for Semi-Supervised Deep Regression with Generalized Ordinal Rankings from Spectral SeriationCe Wang, Weihang Dai, Hanru Bai et al.
Contrastive learning methods enforce label distance relationships in feature space to improve representation capability for regression models. However, these methods highly depend on label information to correctly recover ordinal relationships of features, limiting their applications to semi-supervised regression. In this work, we extend contrastive regression methods to allow unlabeled data to be used in the semi-supervised setting, thereby reducing the dependence on costly annotations. Particularly we construct the feature similarity matrix with both labeled and unlabeled samples in a mini-batch to reflect inter-sample relationships, and an accurate ordinal ranking of involved unlabeled samples can be recovered through spectral seriation algorithms if the level of error is within certain bounds. The introduction of labeled samples above provides regularization of the ordinal ranking with guidance from the ground-truth label information, making the ranking more reliable. To reduce feature perturbations, we further utilize the dynamic programming algorithm to select robust features for the matrix construction. The recovered ordinal relationship is then used for contrastive learning on unlabeled samples, and we thus allow more data to be used for feature representation learning, thereby achieving more robust results. The ordinal rankings can also be used to supervise predictions on unlabeled samples, serving as an additional training signal. We provide theoretical guarantees and empirical verification through experiments on various datasets, demonstrating that our method can surpass existing state-of-the-art semi-supervised deep regression methods. Our code have been released on https://github.com/xmed-lab/CLSS.
CVSep 15, 2022
Exploring Visual Interpretability for Contrastive Language-Image Pre-trainingYi Li, Hualiang Wang, Yiqun Duan et al.
Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail, segmentation, retrieval, caption, and video. However, the visual explainability of CLIP is rarely studied, especially for the raw feature map. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and shows erroneous visualization results against human understanding. This phenomenon is universal for both vision transformers and convolutional networks, which suggests this problem is unique and not owing to certain network. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. For this problem, we propose the Explainable Contrastive Language-Image Pre-training (ECLIP), which corrects the explainability via the Masked Max Pooling. Specifically, to avoid the semantic shift, we replace the original attention pooling by max pooling to focus on the confident foreground, with guidance from free attention during training. Experiments on three datasets suggest that ECLIP greatly improves the explainability of CLIP, and beyond previous explainability methods at large margins. The code will be released later.
CVSep 27, 2022
FreeSeg: Free Mask from Interpretable Contrastive Language-Image Pretraining for Semantic SegmentationYi Li, Huifeng Yao, Hualiang Wang et al.
Fully supervised semantic segmentation learns from dense masks, which requires heavy annotation cost for closed set. In this paper, we use natural language as supervision without any pixel-level annotation for open world segmentation. We call the proposed framework as FreeSeg, where the mask is freely available from raw feature map of pretraining model. Compared with zero-shot or openset segmentation, FreeSeg doesn't require any annotated masks, and it widely predicts categories beyond class-agnostic unsupervised segmentation. Specifically, FreeSeg obtains free mask from Image-Text Similarity Map (ITSM) of Interpretable Contrastive Language-Image Pretraining (ICLIP). And our core improvements are the smoothed min pooling for dense ICLIP, with the partial label and pixel strategies for segmentation. Furthermore, FreeSeg is very straight forward without complex design like grouping, clustering or retrieval. Besides the simplicity, the performances of FreeSeg surpass previous state-of-the-art at large margins, e.g. 13.4% higher at mIoU on VOC dataset in the same settings.
CVJul 24, 2024Code
CoMoTo: Unpaired Cross-Modal Lesion Distillation Improves Breast Lesion Detection in TomosynthesisMuhammad Alberb, Marawan Elbatel, Aya Elgebaly et al.
Digital Breast Tomosynthesis (DBT) is an advanced breast imaging modality that offers superior lesion detection accuracy compared to conventional mammography, albeit at the trade-off of longer reading time. Accelerating lesion detection from DBT using deep learning is hindered by limited data availability and huge annotation costs. A possible solution to this issue could be to leverage the information provided by a more widely available modality, such as mammography, to enhance DBT lesion detection. In this paper, we present a novel framework, CoMoTo, for improving lesion detection in DBT. Our framework leverages unpaired mammography data to enhance the training of a DBT model, improving practicality by eliminating the need for mammography during inference. Specifically, we propose two novel components, Lesion-specific Knowledge Distillation (LsKD) and Intra-modal Point Alignment (ImPA). LsKD selectively distills lesion features from a mammography teacher model to a DBT student model, disregarding background features. ImPA further enriches LsKD by ensuring the alignment of lesion features within the teacher before distilling knowledge to the student. Our comprehensive evaluation shows that CoMoTo is superior to traditional pretraining and image-level KD, improving performance by 7% Mean Sensitivity under low-data setting. Our code is available at https://github.com/Muhammad-Al-Barbary/CoMoTo .
IVSep 13, 2024
Tri-Plane Mamba: Efficiently Adapting Segment Anything Model for 3D Medical ImagesHualiang Wang, Yiqun Lin, Xinpeng Ding et al.
General networks for 3D medical image segmentation have recently undergone extensive exploration. Behind the exceptional performance of these networks lies a significant demand for a large volume of pixel-level annotated data, which is time-consuming and labor-intensive. The emergence of the Segment Anything Model (SAM) has enabled this model to achieve superior performance in 2D medical image segmentation tasks via parameter- and data-efficient feature adaptation. However, the introduction of additional depth channels in 3D medical images not only prevents the sharing of 2D pre-trained features but also results in a quadratic increase in the computational cost for adapting SAM. To overcome these challenges, we present the Tri-Plane Mamba (TP-Mamba) adapters tailored for the SAM, featuring two major innovations: 1) multi-scale 3D convolutional adapters, optimized for efficiently processing local depth-level information, 2) a tri-plane mamba module, engineered to capture long-range depth-level representation without significantly increasing computational costs. This approach achieves state-of-the-art performance in 3D CT organ segmentation tasks. Remarkably, this superior performance is maintained even with scarce training data. Specifically using only three CT training samples from the BTCV dataset, it surpasses conventional 3D segmentation networks, attaining a Dice score that is up to 12% higher.
CVOct 2, 2023
Towards Distribution-Agnostic Generalized Category DiscoveryJianhong Bai, Zuozhu Liu, Hualiang Wang et al.
Data imbalance and open-ended distribution are two intrinsic characteristics of the real visual world. Though encouraging progress has been made in tackling each challenge separately, few works dedicated to combining them towards real-world scenarios. While several previous works have focused on classifying close-set samples and detecting open-set samples during testing, it's still essential to be able to classify unknown subjects as human beings. In this paper, we formally define a more realistic task as distribution-agnostic generalized category discovery (DA-GCD): generating fine-grained predictions for both close- and open-set classes in a long-tailed open-world setting. To tackle the challenging problem, we propose a Self-Balanced Co-Advice contrastive framework (BaCon), which consists of a contrastive-learning branch and a pseudo-labeling branch, working collaboratively to provide interactive supervision to resolve the DA-GCD task. In particular, the contrastive-learning branch provides reliable distribution estimation to regularize the predictions of the pseudo-labeling branch, which in turn guides contrastive learning through self-balanced knowledge transfer and a proposed novel contrastive loss. We compare BaCon with state-of-the-art methods from two closely related fields: imbalanced semi-supervised learning and generalized category discovery. The effectiveness of BaCon is demonstrated with superior performance over all baselines and comprehensive analysis across various datasets. Our code is publicly available.
CVJul 22, 2023
Morphology-inspired Unsupervised Gland Segmentation via Selective Semantic GroupingQixiang Zhang, Yi Li, Cheng Xue et al.
Designing deep learning algorithms for gland segmentation is crucial for automatic cancer diagnosis and prognosis, yet the expensive annotation cost hinders the development and application of this technology. In this paper, we make a first attempt to explore a deep learning method for unsupervised gland segmentation, where no manual annotations are required. Existing unsupervised semantic segmentation methods encounter a huge challenge on gland images: They either over-segment a gland into many fractions or under-segment the gland regions by confusing many of them with the background. To overcome this challenge, our key insight is to introduce an empirical cue about gland morphology as extra knowledge to guide the segmentation process. To this end, we propose a novel Morphology-inspired method via Selective Semantic Grouping. We first leverage the empirical cue to selectively mine out proposals for gland sub-regions with variant appearances. Then, a Morphology-aware Semantic Grouping module is employed to summarize the overall information about the gland by explicitly grouping the semantics of its sub-region proposals. In this way, the final segmentation network could learn comprehensive knowledge about glands and produce well-delineated, complete predictions. We conduct experiments on GlaS dataset and CRAG dataset. Our method exceeds the second-best counterpart over 10.56% at mIOU.
74.4HCMar 17
Small Talk, Big Impact? LLM-based Conversational Agents to Mitigate Passive Fatigue in Conditional Automated DrivingLewis Cockram, Yueteng Yu, Jorge Pardo et al.
Passive fatigue during conditional automated driving can compromise driver readiness and safety. This paper presents findings from a test-track study with 40 participants in a real-world automated driving scenario. In this scenario, a Large Language Model (LLM) based conversational agent (CA) was designed to check in with drivers and re-engage them with their surroundings. Drawing on in-car video recordings, sleepiness ratings and interviews, we analysed how drivers interacted with the agent and how these interactions shaped alertness. Results show the CA is helpful for supporting vigilance during passive fatigue. Thematic analysis of acceptability further revealed three user preference profiles that implicate future intention to use CAs. Positioning empirically observed profiles within existing CA archetype frameworks highlights the need for adaptive design sensitive to diverse user groups. This work underscores the potential of CAs as proactive Human-Machine Interface (HMI) interventions, demonstrating how natural language can support context-aware interaction during automated driving.
CVMar 22, 2023
Reliable and Efficient Evaluation of Adversarial Robustness for Deep Hashing-Based RetrievalXunguang Wang, Jiawang Bai, Xinyue Xu et al.
Deep hashing has been extensively applied to massive image retrieval due to its efficiency and effectiveness. Recently, several adversarial attacks have been presented to reveal the vulnerability of deep hashing models against adversarial examples. However, existing attack methods suffer from degraded performance or inefficiency because they underutilize the semantic relations between original samples or spend a lot of time learning these relations with a deep neural network. In this paper, we propose a novel Pharos-guided Attack, dubbed PgA, to evaluate the adversarial robustness of deep hashing networks reliably and efficiently. Specifically, we design pharos code to represent the semantics of the benign image, which preserves the similarity to semantically relevant samples and dissimilarity to irrelevant ones. It is proven that we can quickly calculate the pharos code via a simple math formula. Accordingly, PgA can directly conduct a reliable and efficient attack on deep hashing-based retrieval by maximizing the similarity between the hash code of the adversarial example and the pharos code. Extensive experiments on the benchmark datasets verify that the proposed algorithm outperforms the prior state-of-the-arts in both attack strength and speed.
CVOct 31, 2025Code
ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual DecodingHaonan Wang, Jingyu Lu, Hongrui Li et al.
Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: https://github.com/xmed-lab/ZEBRA.
92.1CRApr 17
Into the Gray Zone: Domain Contexts Can Blur LLM Safety BoundariesKi Sen Hung, Xi Yang, Chang Liu et al.
A central goal of LLM alignment is to balance helpfulness with harmlessness, yet these objectives conflict when the same knowledge serves both legitimate and malicious purposes. This tension is amplified by context-sensitive alignment: we observe that domain-specific contexts (e.g., chemistry) selectively relax defenses for domain-relevant harmful knowledge, while safety-research contexts (e.g., jailbreak studies) trigger broader relaxation spanning all harm categories. To systematically exploit this vulnerability, we propose Jargon, a framework combining safety-research contexts with multi-turn adversarial interactions that achieves attack success rates exceeding 93% across seven frontier models, including GPT-5.2, Claude-4.5, and Gemini-3, substantially outperforming existing methods. Activation space analysis reveals that Jargon queries occupy an intermediate region between benign and harmful inputs, a gray zone where refusal decisions become unreliable. To mitigate this vulnerability, we design a policy-guided safeguard that steers models toward helpful yet harmless responses, and internalize this capability through alignment fine-tuning, reducing attack success rates while preserving helpfulness.
LGOct 4, 2022
TripleE: Easy Domain Generalization via Episodic ReplayXiaomeng Li, Hongyu Ren, Huifeng Yao et al.
Learning how to generalize the model to unseen domains is an important area of research. In this paper, we propose TripleE, and the main idea is to encourage the network to focus on training on subsets (learning with replay) and enlarge the data space in learning on subsets. Learning with replay contains two core designs, EReplayB and EReplayD, which conduct the replay schema on batch and dataset, respectively. Through this, the network can focus on learning with subsets instead of visiting the global set at a glance, enlarging the model diversity in ensembling. To enlarge the data space in learning on subsets, we verify that an exhaustive and singular augmentation (ESAug) performs surprisingly well on expanding the data space in subsets during replays. Our model dubbed TripleE is frustratingly easy, based on simple augmentation and ensembling. Without bells and whistles, our TripleE method surpasses prior arts on six domain generalization benchmarks, showing that this approach could serve as a stepping stone for future research in domain generalization.
IVJul 1, 2024
Learning 3D Gaussians for Extremely Sparse-View Cone-Beam CT ReconstructionYiqun Lin, Hualiang Wang, Jixiang Chen et al.
Cone-Beam Computed Tomography (CBCT) is an indispensable technique in medical imaging, yet the associated radiation exposure raises concerns in clinical practice. To mitigate these risks, sparse-view reconstruction has emerged as an essential research direction, aiming to reduce the radiation dose by utilizing fewer projections for CT reconstruction. Although implicit neural representations have been introduced for sparse-view CBCT reconstruction, existing methods primarily focus on local 2D features queried from sparse projections, which is insufficient to process the more complicated anatomical structures, such as the chest. To this end, we propose a novel reconstruction framework, namely DIF-Gaussian, which leverages 3D Gaussians to represent the feature distribution in the 3D space, offering additional 3D spatial information to facilitate the estimation of attenuation coefficients. Furthermore, we incorporate test-time optimization during inference to further improve the generalization capability of the model. We evaluate DIF-Gaussian on two public datasets, showing significantly superior reconstruction performance than previous state-of-the-art methods.
47.6DCMay 7
EdgeServing: Deadline-Aware Multi-DNN Serving at the EdgeJiahe Cao, Xiaomeng Li, Qiang Liu et al.
As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency, enabled by early-exit mechanism, which expands the scheduling action space under tight latency constraints.