CVJul 23, 2023Code
SwIPE: Efficient and Robust Medical Image Segmentation with Implicit Patch EmbeddingsYejia Zhang, Pengfei Gu, Nishchal Sapkota et al.
Modern medical image segmentation methods primarily use discrete representations in the form of rasterized masks to learn features and generate predictions. Although effective, this paradigm is spatially inflexible, scales poorly to higher-resolution images, and lacks direct understanding of object shapes. To address these limitations, some recent works utilized implicit neural representations (INRs) to learn continuous representations for segmentation. However, these methods often directly adopted components designed for 3D shape reconstruction. More importantly, these formulations were also constrained to either point-based or global contexts, lacking contextual understanding or local fine-grained details, respectively--both critical for accurate segmentation. To remedy this, we propose a novel approach, SwIPE (Segmentation with Implicit Patch Embeddings), that leverages the advantages of INRs and predicts shapes at the patch level--rather than at the point level or image level--to enable both accurate local boundary delineation and global shape coherence. Extensive evaluations on two tasks (2D polyp segmentation and 3D abdominal organ segmentation) show that SwIPE significantly improves over recent implicit approaches and outperforms state-of-the-art discrete methods with over 10x fewer parameters. Our method also demonstrates superior data efficiency and improved robustness to data shifts across image resolutions and datasets. Code is available on Github (https://github.com/charzharr/miccai23-swipe-implicit-segmentation).
CVNov 15, 2022
Unsupervised Feature Clustering Improves Contrastive Representation Learning for Medical Image SegmentationYejia Zhang, Xinrong Hu, Nishchal Sapkota et al.
Self-supervised instance discrimination is an effective contrastive pretext task to learn feature representations and address limited medical image annotations. The idea is to make features of transformed versions of the same images similar while forcing all other augmented images' representations to contrast. However, this instance-based contrastive learning leaves performance on the table by failing to maximize feature affinity between images with similar content while counter-productively pushing their representations apart. Recent improvements on this paradigm (e.g., leveraging multi-modal data, different images in longitudinal studies, spatial correspondences) either relied on additional views or made stringent assumptions about data properties, which can sacrifice generalizability and applicability. To address this challenge, we propose a new self-supervised contrastive learning method that uses unsupervised feature clustering to better select positive and negative image samples. More specifically, we produce pseudo-classes by hierarchically clustering features obtained by an auto-encoder in an unsupervised manner, and prevent destructive interference during contrastive learning by avoiding the selection of negatives from the same pseudo-class. Experiments on 2D skin dermoscopic image segmentation and 3D multi-class whole heart CT segmentation demonstrate that our method outperforms state-of-the-art self-supervised contrastive techniques on these tasks.
CVNov 15, 2022
ConvFormer: Combining CNN and Transformer for Medical Image SegmentationPengfei Gu, Yejia Zhang, Chaoli Wang et al.
Convolutional neural network (CNN) based methods have achieved great successes in medical image segmentation, but their capability to learn global representations is still limited due to using small effective receptive fields of convolution operations. Transformer based methods are capable of modelling long-range dependencies of information for capturing global representations, yet their ability to model local context is lacking. Integrating CNN and Transformer to learn both local and global representations while exploring multi-scale features is instrumental in further improving medical image segmentation. In this paper, we propose a hierarchical CNN and Transformer hybrid architecture, called ConvFormer, for medical image segmentation. ConvFormer is based on several simple yet effective designs. (1) A feed forward module of Deformable Transformer (DeTrans) is re-designed to introduce local information, called Enhanced DeTrans. (2) A residual-shaped hybrid stem based on a combination of convolutions and Enhanced DeTrans is developed to capture both local and global representations to enhance representation ability. (3) Our encoder utilizes the residual-shaped hybrid stem in a hierarchical manner to generate feature maps in different scales, and an additional Enhanced DeTrans encoder with residual connections is built to exploit multi-scale features with feature maps of different scales as input. Experiments on several datasets show that our ConvFormer, trained from scratch, outperforms various CNN- or Transformer-based architectures, achieving state-of-the-art performance.
LGSep 9, 2023
RR-CP: Reliable-Region-Based Conformal Prediction for Trustworthy Medical Image ClassificationYizhe Zhang, Shuo Wang, Yejia Zhang et al.
Conformal prediction (CP) generates a set of predictions for a given test sample such that the prediction set almost always contains the true label (e.g., 99.5\% of the time). CP provides comprehensive predictions on possible labels of a given test sample, and the size of the set indicates how certain the predictions are (e.g., a set larger than one is `uncertain'). Such distinct properties of CP enable effective collaborations between human experts and medical AI models, allowing efficient intervention and quality check in clinical decision-making. In this paper, we propose a new method called Reliable-Region-Based Conformal Prediction (RR-CP), which aims to impose a stronger statistical guarantee so that the user-specified error rate (e.g., 0.5\%) can be achieved in the test time, and under this constraint, the size of the prediction set is optimized (to be small). We consider a small prediction set size an important measure only when the user-specified error rate is achieved. Experiments on five public datasets show that our RR-CP performs well: with a reasonably small-sized prediction set, it achieves the user-specified error rate (e.g., 0.5\%) significantly more frequently than exiting CP methods.
CVNov 6, 2025Code
When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image SegmentationNishchal Sapkota, Haoyan Shi, Yejia Zhang et al.
Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST
CVNov 16, 2022
Keep Your Friends Close & Enemies Farther: Debiasing Contrastive Learning with Spatial Priors in 3D Radiology ImagesYejia Zhang, Nishchal Sapkota, Pengfei Gu et al.
Understanding of spatial attributes is central to effective 3D radiology image analysis where crop-based learning is the de facto standard. Given an image patch, its core spatial properties (e.g., position & orientation) provide helpful priors on expected object sizes, appearances, and structures through inherent anatomical consistencies. Spatial correspondences, in particular, can effectively gauge semantic similarities between inter-image regions, while their approximate extraction requires no annotations or overbearing computational costs. However, recent 3D contrastive learning approaches either neglect correspondences or fail to maximally capitalize on them. To this end, we propose an extensible 3D contrastive framework (Spade, for Spatial Debiasing) that leverages extracted correspondences to select more effective positive & negative samples for representation learning. Our method learns both globally invariant and locally equivariant representations with downstream segmentation in mind. We also propose separate selection strategies for global & local scopes that tailor to their respective representational requirements. Compared to recent state-of-the-art approaches, Spade shows notable improvements on three downstream segmentation tasks (CT Abdominal Organ, CT Heart, MR Heart).
CVNov 15, 2022
A Point in the Right Direction: Vector Prediction for Spatially-aware Self-supervised Volumetric Representation LearningYejia Zhang, Pengfei Gu, Nishchal Sapkota et al.
High annotation costs and limited labels for dense 3D medical imaging tasks have recently motivated an assortment of 3D self-supervised pretraining methods that improve transfer learning performance. However, these methods commonly lack spatial awareness despite its centrality in enabling effective 3D image analysis. More specifically, position, scale, and orientation are not only informative but also automatically available when generating image crops for training. Yet, to date, no work has proposed a pretext task that distills all key spatial features. To fulfill this need, we develop a new self-supervised method, VectorPOSE, which promotes better spatial understanding with two novel pretext tasks: Vector Prediction (VP) and Boundary-Focused Reconstruction (BFR). VP focuses on global spatial concepts (i.e., properties of 3D patches) while BFR addresses weaknesses of recent reconstruction methods to learn more effective local representations. We evaluate VectorPOSE on three 3D medical image segmentation tasks, showing that it often outperforms state-of-the-art methods, especially in limited annotation settings.
CVFeb 6, 2024
SHMC-Net: A Mask-guided Feature Fusion Network for Sperm Head Morphology ClassificationNishchal Sapkota, Yejia Zhang, Sirui Li et al.
Male infertility accounts for about one-third of global infertility cases. Manual assessment of sperm abnormalities through head morphology analysis encounters issues of observer variability and diagnostic discrepancies among experts. Its alternative, Computer-Assisted Semen Analysis (CASA), suffers from low-quality sperm images, small datasets, and noisy class labels. We propose a new approach for sperm head morphology classification, called SHMC-Net, which uses segmentation masks of sperm heads to guide the morphology classification of sperm images. SHMC-Net generates reliable segmentation masks using image priors, refines object boundaries with an efficient graph-based method, and trains an image network with sperm head crops and a mask network with the corresponding masks. In the intermediate stages of the networks, image and mask features are fused with a fusion scheme to better learn morphological features. To handle noisy class labels and regularize training on small datasets, SHMC-Net applies Soft Mixup to combine mixup augmentation and a loss function. We achieve state-of-the-art results on SCIAN and HuSHeM datasets, outperforming methods that use additional pre-training or costly ensembling techniques.
IVFeb 6, 2024
ConUNETR: A Conditional Transformer Network for 3D Micro-CT Embryonic Cartilage SegmentationNishchal Sapkota, Yejia Zhang, Susan M. Motch Perrine et al.
Studying the morphological development of cartilaginous and osseous structures is critical to the early detection of life-threatening skeletal dysmorphology. Embryonic cartilage undergoes rapid structural changes within hours, introducing biological variations and morphological shifts that limit the generalization of deep learning-based segmentation models that infer across multiple embryonic age groups. Obtaining individual models for each age group is expensive and less effective, while direct transfer (predicting an age unseen during training) suffers a potential performance drop due to morphological shifts. We propose a novel Transformer-based segmentation model with improved biological priors that better distills morphologically diverse information through conditional mechanisms. This enables a single model to accurately predict cartilage across multiple age groups. Experiments on the mice cartilage dataset show the superiority of our new model compared to other competitive segmentation models. Additional studies on a separate mice cartilage dataset with a distinct mutation show that our model generalizes well and effectively captures age-based cartilage morphology patterns.
CVAug 3, 2025
TopoImages: Incorporating Local Topology Encoding into Deep Learning Models for Medical Image ClassificationPengfei Gu, Hongxiao Wang, Yejia Zhang et al.
Topological structures in image data, such as connected components and loops, play a crucial role in understanding image content (e.g., biomedical objects). % Despite remarkable successes of numerous image processing methods that rely on appearance information, these methods often lack sensitivity to topological structures when used in general deep learning (DL) frameworks. % In this paper, we introduce a new general approach, called TopoImages (for Topology Images), which computes a new representation of input images by encoding local topology of patches. % In TopoImages, we leverage persistent homology (PH) to encode geometric and topological features inherent in image patches. % Our main objective is to capture topological information in local patches of an input image into a vectorized form. % Specifically, we first compute persistence diagrams (PDs) of the patches, % and then vectorize and arrange these PDs into long vectors for pixels of the patches. % The resulting multi-channel image-form representation is called a TopoImage. % TopoImages offers a new perspective for data analysis. % To garner diverse and significant topological features in image data and ensure a more comprehensive and enriched representation, we further generate multiple TopoImages of the input image using various filtration functions, which we call multi-view TopoImages. % The multi-view TopoImages are fused with the input image for DL-based classification, with considerable improvement. % Our TopoImages approach is highly versatile and can be seamlessly integrated into common DL frameworks. Experiments on three public medical image classification datasets demonstrate noticeably improved accuracy over state-of-the-art methods.
CVOct 10, 2025
Cell Instance Segmentation: The Devil Is in the BoundariesPeixian Liang, Yifan Ding, Yizhe Zhang et al.
State-of-the-art (SOTA) methods for cell instance segmentation are based on deep learning (DL) semantic segmentation approaches, focusing on distinguishing foreground pixels from background pixels. In order to identify cell instances from foreground pixels (e.g., pixel clustering), most methods decompose instance information into pixel-wise objectives, such as distances to foreground-background boundaries (distance maps), heat gradients with the center point as heat source (heat diffusion maps), and distances from the center point to foreground-background boundaries with fixed angles (star-shaped polygons). However, pixel-wise objectives may lose significant geometric properties of the cell instances, such as shape, curvature, and convexity, which require a collection of pixels to represent. To address this challenge, we present a novel pixel clustering method, called Ceb (for Cell boundaries), to leverage cell boundary features and labels to divide foreground pixels into cell instances. Starting with probability maps generated from semantic segmentation, Ceb first extracts potential foreground-foreground boundaries with a revised Watershed algorithm. For each boundary candidate, a boundary feature representation (called boundary signature) is constructed by sampling pixels from the current foreground-foreground boundary as well as the neighboring background-foreground boundaries. Next, a boundary classifier is used to predict its binary boundary label based on the corresponding boundary signature. Finally, cell instances are obtained by dividing or merging neighboring regions based on the predicted boundary labels. Extensive experiments on six datasets demonstrate that Ceb outperforms existing pixel clustering methods on semantic segmentation probability maps. Moreover, Ceb achieves highly competitive performance compared to SOTA cell instance segmentation methods.
IVOct 16, 2024
UniCoN: Universal Conditional Networks for Multi-Age Embryonic Cartilage Segmentation with Sparsely Annotated DataNishchal Sapkota, Yejia Zhang, Zihao Zhao et al.
Osteochondrodysplasia, affecting 2-3% of newborns globally, is a group of bone and cartilage disorders that often result in head malformations, contributing to childhood morbidity and reduced quality of life. Current research on this disease using mouse models faces challenges since it involves accurately segmenting the developing cartilage in 3D micro-CT images of embryonic mice. Tackling this segmentation task with deep learning (DL) methods is laborious due to the big burden of manual image annotation, expensive due to the high acquisition costs of 3D micro-CT images, and difficult due to embryonic cartilage's complex and rapidly changing shapes. While DL approaches have been proposed to automate cartilage segmentation, most such models have limited accuracy and generalizability, especially across data from different embryonic age groups. To address these limitations, we propose novel DL methods that can be adopted by any DL architectures -- including CNNs, Transformers, or hybrid models -- which effectively leverage age and spatial information to enhance model performance. Specifically, we propose two new mechanisms, one conditioned on discrete age categories and the other on continuous image crop locations, to enable an accurate representation of cartilage shape changes across ages and local shape details throughout the cranial region. Extensive experiments on multi-age cartilage segmentation datasets show significant and consistent performance improvements when integrating our conditional modules into popular DL segmentation architectures. On average, we achieve a 1.7% Dice score increase with minimal computational overhead and a 7.5% improvement on unseen data. These results highlight the potential of our approach for developing robust, universal models capable of handling diverse datasets with limited annotated data, a key challenge in DL-based medical image analysis.
CVJun 15, 2024
Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image SegmentationPengfei Gu, Huimin Li, Yejia Zhang et al.
Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.
CVFeb 15, 2022
Improving Human Sperm Head Morphology Classification with Unsupervised Anatomical Feature DistillationYejia Zhang, Jingjing Zhang, Xiaomin Zha et al.
With rising male infertility, sperm head morphology classification becomes critical for accurate and timely clinical diagnosis. Recent deep learning (DL) morphology analysis methods achieve promising benchmark results, but leave performance and robustness on the table by relying on limited and possibly noisy class labels. To address this, we introduce a new DL training framework that leverages anatomical and image priors from human sperm microscopy crops to extract useful features without additional labeling cost. Our core idea is to distill sperm head information with reliably-generated pseudo-masks and unsupervised spatial prediction tasks. The predicted foreground masks from this distillation step are then leveraged to regularize and reduce image and label noise in the tuning stage. We evaluate our new approach on two public sperm datasets and achieve state-of-the-art performances (e.g. 65.9% SCIAN accuracy and 96.5% HuSHeM accuracy).
CVFeb 1, 2018
A New Registration Approach for Dynamic Analysis of Calcium Signals in OrgansPeixian Liang, Jianxu Chen, Pavel A. Brodskiy et al.
Wing disc pouches of fruit flies are a powerful genetic model for studying physiological intercellular calcium ($Ca^{2+}$) signals for dynamic analysis of cell signaling in organ development and disease studies. A key to analyzing spatial-temporal patterns of $Ca^{2+}$ signal waves is to accurately align the pouches across image sequences. However, pouches in different image frames may exhibit extensive intensity oscillations due to $Ca^{2+}$ signaling dynamics, and commonly used multimodal non-rigid registration methods may fail to achieve satisfactory results. In this paper, we develop a new two-phase non-rigid registration approach to register pouches in image sequences. First, we conduct segmentation of the region of interest. (i.e., pouches) using a deep neural network model. Second, we obtain an optimal transformation and align pouches across the image sequences. Evaluated using both synthetic data and real pouch data, our method considerably outperforms the state-of-the-art non-rigid registration methods.