CVMay 17, 2022Code
Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical ImagingRui Yan, Liangqiong Qu, Qingyue Wei et al.
The collection and curation of large-scale medical datasets from multiple institutions is essential for training accurate deep learning models, but privacy concerns often hinder data sharing. Federated learning (FL) is a promising solution that enables privacy-preserving collaborative learning among different institutions, but it generally suffers from performance deterioration due to heterogeneous data distributions and a lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Our method introduces a novel Transformer-based self-supervised pre-training paradigm that pre-trains models directly on decentralized target task datasets using masked image modeling, to facilitate more robust representation learning on heterogeneous data and effective knowledge transfer to downstream models. Extensive empirical results on simulated and real-world medical imaging non-IID federated datasets show that masked image modeling with Transformers significantly improves the robustness of models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared to the supervised baseline with ImageNet pre-training. In addition, we show that our federated self-supervised pre-training methods yield models that generalize better to out-of-distribution data and perform more effectively when fine-tuning with limited labeled data, compared to existing FL algorithms. The code is available at https://github.com/rui-yan/SSL-FL.
CVJul 21, 2023Code
Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image SegmentationQingyue Wei, Lequan Yu, Xianhang Li et al.
Medical imaging has witnessed remarkable progress but usually requires a large amount of high-quality annotated data which is time-consuming and costly to obtain. To alleviate this burden, semi-supervised learning has garnered attention as a potential solution. In this paper, we present Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg), a novel method for tackling the challenge of semi-supervised medical image segmentation. Specifically, our approach first involves training a segmentation model on a small set of clean labeled images to generate initial labels for unlabeled data. To further optimize this bootstrapping process, we introduce a per-pixel weight mapping system that dynamically assigns weights to both the initialized labels and the model's own predictions. These weights are determined using a meta-process that prioritizes pixels with loss gradient directions closer to those of clean data, which is based on a small set of precisely annotated images. To facilitate the meta-learning process, we additionally introduce a consistency-based Pseudo Label Enhancement (PLE) scheme that improves the quality of the model's own predictions by ensembling predictions from various augmented versions of the same input. In order to improve the quality of the weight maps obtained through multiple augmentations of a single input, we introduce a mean teacher into the PLE scheme. This method helps to reduce noise in the weight maps and stabilize its generation process. Our extensive experimental results on public atrial and prostate segmentation datasets demonstrate that our proposed method achieves state-of-the-art results under semi-supervision. Our code is available at https://github.com/aijinrjinr/MLB-Seg.
CVOct 11, 2023
3D TransUNet: Advancing Medical Image Segmentation through Vision TransformersJieneng Chen, Jieru Mei, Xianhang Li et al.
Medical image segmentation plays a crucial role in advancing healthcare systems for disease diagnosis and treatment planning. The u-shaped architecture, popularly known as U-Net, has proven highly successful for various medical image segmentation tasks. However, U-Net's convolution-based operations inherently limit its ability to model long-range dependencies effectively. To address these limitations, researchers have turned to Transformers, renowned for their global self-attention mechanisms, as alternative architectures. One popular network is our previous TransUNet, which leverages Transformers' self-attention to complement U-Net's localized information with the global context. In this paper, we extend the 2D TransUNet architecture to a 3D network by building upon the state-of-the-art nnU-Net architecture, and fully exploring Transformers' potential in both the encoder and decoder design. We introduce two key components: 1) A Transformer encoder that tokenizes image patches from a convolution neural network (CNN) feature map, enabling the extraction of global contexts, and 2) A Transformer decoder that adaptively refines candidate regions by utilizing cross-attention between candidate proposals and U-Net features. Our investigations reveal that different medical tasks benefit from distinct architectural designs. The Transformer encoder excels in multi-organ segmentation, where the relationship among organs is crucial. On the other hand, the Transformer decoder proves more beneficial for dealing with small and challenging segmented targets such as tumor segmentation. Extensive experiments showcase the significant potential of integrating a Transformer-based encoder and decoder into the u-shaped medical image segmentation architecture. TransUNet outperforms competitors in various medical applications.
NCOct 27, 2022
Joint Graph Convolution for Analyzing Brain Structural and Functional ConnectomeYueting Li, Qingyue Wei, Ehsan Adeli et al.
The white-matter (micro-)structural architecture of the brain promotes synchrony among neuronal populations, giving rise to richly patterned functional connections. A fundamental problem for systems neuroscience is determining the best way to relate structural and functional networks quantified by diffusion tensor imaging and resting-state functional MRI. As one of the state-of-the-art approaches for network analysis, graph convolutional networks (GCN) have been separately used to analyze functional and structural networks, but have not been applied to explore inter-network relationships. In this work, we propose to couple the two networks of an individual by adding inter-network edges between corresponding brain regions, so that the joint structure-function graph can be directly analyzed by a single GCN. The weights of inter-network edges are learnable, reflecting non-uniform structure-function coupling strength across the brain. We apply our Joint-GCN to predict age and sex of 662 participants from the public dataset of the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA) based on their functional and micro-structural white-matter networks. Our results support that the proposed Joint-GCN outperforms existing multi-modal graph learning approaches for analyzing structural and functional networks.
CVMar 27, 2024Code
Unleashing the Potential of SAM for Medical Adaptation via Hierarchical DecodingZhiheng Cheng, Qingyue Wei, Hongru Zhu et al.
The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However, its application in medical imaging presents challenges, requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient fine-tuning of medical images via a two-stage hierarchical decoding procedure. In the initial stage, H-SAM employs SAM's original decoder to generate a prior probabilistic mask, guiding a more intricate decoding process in the second stage. Specifically, we propose two key designs: 1) A class-balanced, mask-guided self-attention mechanism addressing the unbalanced label distribution, enhancing image embedding; 2) A learnable mask cross-attention mechanism spatially modulating the interplay among different image regions based on the prior mask. Moreover, the inclusion of a hierarchical pixel decoder in H-SAM enhances its proficiency in capturing fine-grained and localized details. This approach enables SAM to effectively integrate learned medical priors, facilitating enhanced adaptation for medical image segmentation with limited samples. Our H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM even outperforms state-of-the-art semi-supervised models relying on extensive unlabeled training data across various medical datasets. Our code is available at https://github.com/Cccccczh404/H-SAM.
CVApr 24, 2024Code
RetinaRegNet: A Zero-Shot Approach for Retinal Image RegistrationVishal Balaji Sivaraman, Muhammad Imran, Qingyue Wei et al.
We introduce RetinaRegNet, a zero-shot image registration model designed to register retinal images with minimal overlap, large deformations, and varying image quality. RetinaRegNet addresses these challenges and achieves robust and accurate registration through the following steps. First, we extract features from the moving and fixed images using latent diffusion models. We then sample feature points from the fixed image using a combination of the SIFT algorithm and random point sampling. For each sampled point, we identify its corresponding point in the moving image using a 2D correlation map, which computes the cosine similarity between the diffusion feature vectors of the point in the fixed image and all pixels in the moving image. Second, we eliminate most incorrectly detected point correspondences (outliers) by enforcing an inverse consistency constraint, ensuring that correspondences are consistent in both forward and backward directions. We further remove outliers with large distances between corresponding points using a global transformation based outlier detector. Finally, we implement a two-stage registration framework to handle large deformations. The first stage estimates a homography transformation to achieve global alignment between the images, while the second stage uses a third-order polynomial transformation to estimate local deformations. We evaluated RetinaRegNet on three retinal image registration datasets: color fundus images, fluorescein angiography images, and laser speckle flowgraphy images. Our model consistently outperformed state-of-the-art methods across all datasets. The accurate registration achieved by RetinaRegNet enables the tracking of eye disease progression, enhances surgical planning, and facilitates the evaluation of treatment efficacy. Our code is publicly available at: https://github.com/mirthAI/RetinaRegNet.
AIMar 24
Cerebra: A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk AssessmentSheng Liu, Long Chen, Zeyun Zhao et al.
Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra's potential for interpretable, robust decision support in clinical care.
CVFeb 9, 2022Code
L2B: Learning to Bootstrap Robust Models for Combating Label NoiseYuyin Zhou, Xianhang Li, Fengze Liu et al.
Deep neural networks have shown great success in representation learning. However, when learning with noisy labels (LNL), they can easily overfit and fail to generalize to new data. This paper introduces a simple and effective method, named Learning to Bootstrap (L2B), which enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels. It achieves this by dynamically adjusting the importance weight between real observed and generated labels, as well as between different samples through meta-learning. Unlike existing instance reweighting methods, the key to our method lies in a new, versatile objective that enables implicit relabeling concurrently, leading to significant improvements without incurring additional costs. L2B offers several benefits over the baseline methods. It yields more robust models that are less susceptible to the impact of noisy labels by guiding the bootstrapping procedure more effectively. It better exploits the valuable information contained in corrupted instances by adapting the weights of both instances and labels. Furthermore, L2B is compatible with existing LNL methods and delivers competitive results spanning natural and medical imaging tasks including classification and segmentation under both synthetic and real-world noise. Extensive experiments demonstrate that our method effectively mitigates the challenges of noisy labels, often necessitating few to no validation samples, and is well generalized to other tasks such as image segmentation. This not only positions it as a robust complement to existing LNL techniques but also underscores its practical applicability. The code and models are available at https://github.com/yuyinzhou/l2b.
CLMay 23, 2025
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning ModelsChengzhi Liu, Zhongxing Xu, Qingyue Wei et al.
Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.
CVDec 14, 2025
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent SpaceChengzhi Liu, Yuzhe Yang, Yue Fan et al.
Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced cross-modal understanding and reasoning by incorporating Chain-of-Thought (CoT) reasoning in the semantic space. Building upon this, recent studies extend the CoT mechanism to the visual modality, enabling models to integrate visual information during reasoning through external tools or explicit image generation. However, these methods remain dependent on explicit step-by-step reasoning, unstable perception-reasoning interaction and notable computational overhead. Inspired by human cognition, we posit that thinking unfolds not linearly but through the dynamic interleaving of reasoning and perception within the mind. Motivated by this perspective, we propose DMLR, a test-time Dynamic Multimodal Latent Reasoning framework that employs confidence-guided latent policy gradient optimization to refine latent think tokens for in-depth reasoning. Furthermore, a Dynamic Visual Injection Strategy is introduced, which retrieves the most relevant visual features at each latent think token and updates the set of best visual patches. The updated patches are then injected into latent think token to achieve dynamic visual-textual interleaving. Experiments across seven multimodal reasoning benchmarks and various model architectures demonstrate that DMLR significantly improves reasoning and perception performance while maintaining high inference efficiency.
NCAug 7, 2025
Revealing Neurocognitive and Behavioral Patterns by Unsupervised Manifold Learning from Dynamic Brain DataZixia Zhou, Junyan Liu, Wei Emma Wu et al.
Dynamic brain data, teeming with biological and functional insights, are becoming increasingly accessible through advanced measurements, providing a gateway to understanding the inner workings of the brain in living subjects. However, the vast size and intricate complexity of the data also pose a daunting challenge in reliably extracting meaningful information across various data sources. This paper introduces a generalizable unsupervised deep manifold learning for exploration of neurocognitive and behavioral patterns. Unlike existing methods that extract patterns directly from the input data as in the existing methods, the proposed Brain-dynamic Convolutional-Network-based Embedding (BCNE) seeks to capture the brain-state trajectories by deciphering the temporospatial correlations within the data and subsequently applying manifold learning to this correlative representation. The performance of BCNE is showcased through the analysis of several important dynamic brain datasets. The results, both visual and quantitative, reveal a diverse array of intriguing and interpretable patterns. BCNE effectively delineates scene transitions, underscores the involvement of different brain regions in memory and narrative processing, distinguishes various stages of dynamic learning processes, and identifies differences between active and passive behaviors. BCNE provides an effective tool for exploring general neuroscience inquiries or individual-specific patterns.