CVDec 16, 2022
Biomedical image analysis competitions: The state of current participation practiceMatthias Eisenmann, Annika Reinke, Vivienn Weru et al. · utoronto
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
98.2AIJun 4
Towards World Models in Biomedical ResearchGuangyu Wang, Jingkun Yue, Siqi Zhang et al.
A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.
CVJul 7, 2023
Non-iterative Coarse-to-fine Transformer Networks for Joint Affine and Deformable Image RegistrationMingyuan Meng, Lei Bi, Michael Fulham et al.
Image registration is a fundamental requirement for medical image analysis. Deep registration methods based on deep learning have been widely recognized for their capabilities to perform fast end-to-end registration. Many deep registration methods achieved state-of-the-art performance by performing coarse-to-fine registration, where multiple registration steps were iterated with cascaded networks. Recently, Non-Iterative Coarse-to-finE (NICE) registration methods have been proposed to perform coarse-to-fine registration in a single network and showed advantages in both registration accuracy and runtime. However, existing NICE registration methods mainly focus on deformable registration, while affine registration, a common prerequisite, is still reliant on time-consuming traditional optimization-based methods or extra affine registration networks. In addition, existing NICE registration methods are limited by the intrinsic locality of convolution operations. Transformers may address this limitation for their capabilities to capture long-range dependency, but the benefits of using transformers for NICE registration have not been explored. In this study, we propose a Non-Iterative Coarse-to-finE Transformer network (NICE-Trans) for image registration. Our NICE-Trans is the first deep registration method that (i) performs joint affine and deformable coarse-to-fine registration within a single network, and (ii) embeds transformers into a NICE registration framework to model long-range relevance between images. Extensive experiments with seven public datasets show that our NICE-Trans outperforms state-of-the-art registration methods on both registration accuracy and runtime.
IVSep 11, 2023
AutoFuse: Automatic Fusion Networks for Deformable Medical Image RegistrationMingyuan Meng, Michael Fulham, Dagan Feng et al.
Deformable image registration aims to find a dense non-linear spatial correspondence between a pair of images, which is a crucial step for many medical tasks such as tumor growth monitoring and population analysis. Recently, Deep Neural Networks (DNNs) have been widely recognized for their ability to perform fast end-to-end registration. However, DNN-based registration needs to explore the spatial information of each image and fuse this information to characterize spatial correspondence. This raises an essential question: what is the optimal fusion strategy to characterize spatial correspondence? Existing fusion strategies (e.g., early fusion, late fusion) were empirically designed to fuse information by manually defined prior knowledge, which inevitably constrains the registration performance within the limits of empirical designs. In this study, we depart from existing empirically-designed fusion strategies and develop a data-driven fusion strategy for deformable image registration. To achieve this, we propose an Automatic Fusion network (AutoFuse) that provides flexibility to fuse information at many potential locations within the network. A Fusion Gate (FG) module is also proposed to control how to fuse information at each potential network location based on training data. Our AutoFuse can automatically optimize its fusion strategy during training and can be generalizable to both unsupervised registration (without any labels) and semi-supervised registration (with weak labels provided for partial training data). Extensive experiments on two well-benchmarked medical registration tasks (inter- and intra-patient registration) with eight public datasets show that our AutoFuse outperforms state-of-the-art unsupervised and semi-supervised registration methods.
IVJul 7, 2023
Merging-Diverging Hybrid Transformer Networks for Survival Prediction in Head and Neck CancerMingyuan Meng, Lei Bi, Michael Fulham et al.
Survival prediction is crucial for cancer patients as it provides early prognostic information for treatment planning. Recently, deep survival models based on deep learning and medical images have shown promising performance for survival prediction. However, existing deep survival models are not well developed in utilizing multi-modality images (e.g., PET-CT) and in extracting region-specific information (e.g., the prognostic information in Primary Tumor (PT) and Metastatic Lymph Node (MLN) regions). In view of this, we propose a merging-diverging learning framework for survival prediction from multi-modality images. This framework has a merging encoder to fuse multi-modality information and a diverging decoder to extract region-specific information. In the merging encoder, we propose a Hybrid Parallel Cross-Attention (HPCA) block to effectively fuse multi-modality features via parallel convolutional layers and cross-attention transformers. In the diverging decoder, we propose a Region-specific Attention Gate (RAG) block to screen out the features related to lesion regions. Our framework is demonstrated on survival prediction from PET-CT images in Head and Neck (H&N) cancer, by designing an X-shape merging-diverging hybrid transformer network (named XSurv). Our XSurv combines the complementary information in PET and CT images and extracts the region-specific prognostic information in PT and MLN regions. Extensive experiments on the public dataset of HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR 2022) demonstrate that our XSurv outperforms state-of-the-art survival prediction methods.
CVJun 25, 2022
Non-iterative Coarse-to-fine Registration based on Single-pass Deep Cumulative LearningMingyuan Meng, Lei Bi, Dagan Feng et al.
Deformable image registration is a crucial step in medical image analysis for finding a non-linear spatial transformation between a pair of fixed and moving images. Deep registration methods based on Convolutional Neural Networks (CNNs) have been widely used as they can perform image registration in a fast and end-to-end manner. However, these methods usually have limited performance for image pairs with large deformations. Recently, iterative deep registration methods have been used to alleviate this limitation, where the transformations are iteratively learned in a coarse-to-fine manner. However, iterative methods inevitably prolong the registration runtime, and tend to learn separate image features for each iteration, which hinders the features from being leveraged to facilitate the registration at later iterations. In this study, we propose a Non-Iterative Coarse-to-finE registration Network (NICE-Net) for deformable image registration. In the NICE-Net, we propose: (i) a Single-pass Deep Cumulative Learning (SDCL) decoder that can cumulatively learn coarse-to-fine transformations within a single pass (iteration) of the network, and (ii) a Selectively-propagated Feature Learning (SFL) encoder that can learn common image features for the whole coarse-to-fine registration process and selectively propagate the features as needed. Extensive experiments on six public datasets of 3D brain Magnetic Resonance Imaging (MRI) show that our proposed NICE-Net can outperform state-of-the-art iterative deep registration methods while only requiring similar runtime to non-iterative methods.
IVNov 10, 2022
Radiomics-enhanced Deep Multi-task Learning for Outcome Prediction in Head and Neck CancerMingyuan Meng, Lei Bi, Dagan Feng et al.
Outcome prediction is crucial for head and neck cancer patients as it can provide prognostic information for early treatment planning. Radiomics methods have been widely used for outcome prediction from medical images. However, these methods are limited by their reliance on intractable manual segmentation of tumor regions. Recently, deep learning methods have been proposed to perform end-to-end outcome prediction so as to remove the reliance on manual segmentation. Unfortunately, without segmentation masks, these methods will take the whole image as input, such that makes them difficult to focus on tumor regions and potentially unable to fully leverage the prognostic information within the tumor regions. In this study, we propose a radiomics-enhanced deep multi-task framework for outcome prediction from PET/CT images, in the context of HEad and neCK TumOR segmentation and outcome prediction challenge (HECKTOR 2022). In our framework, our novelty is to incorporate radiomics as an enhancement to our recently proposed Deep Multi-task Survival model (DeepMTS). The DeepMTS jointly learns to predict the survival risk scores of patients and the segmentation masks of tumor regions. Radiomics features are extracted from the predicted tumor regions and combined with the predicted survival risk scores for final outcome prediction, through which the prognostic information in tumor regions can be further leveraged. Our method achieved a C-index of 0.681 on the testing set, placing the 2nd on the leaderboard with only 0.00068 lower in C-index than the 1st place.
CVNov 15, 2022
Brain Tumor Sequence Registration with Non-iterative Coarse-to-fine Networks and Dual Deep SupervisionMingyuan Meng, Lei Bi, Dagan Feng et al.
In this study, we focus on brain tumor sequence registration between pre-operative and follow-up Magnetic Resonance Imaging (MRI) scans of brain glioma patients, in the context of Brain Tumor Sequence Registration challenge (BraTS-Reg 2022). Brain tumor registration is a fundamental requirement in brain image analysis for quantifying tumor changes. This is a challenging task due to large deformations and missing correspondences between pre-operative and follow-up scans. For this task, we adopt our recently proposed Non-Iterative Coarse-to-finE registration Networks (NICE-Net) - a deep learning-based method for coarse-to-fine registering images with large deformations. To overcome missing correspondences, we extend the NICE-Net by introducing dual deep supervision, where a deep self-supervised loss based on image similarity and a deep weakly-supervised loss based on manually annotated landmarks are deeply embedded into the NICE-Net. At the BraTS-Reg 2022, our method achieved a competitive result on the validation set (mean absolute error: 3.387) and placed 4th in the final testing phase (Score: 0.3544).
IVNov 28, 2023
Full-resolution MLPs Empower Medical Dense PredictionMingyuan Meng, Yuxin Xue, Dagan Feng et al.
Dense prediction is a fundamental requirement for many medical vision tasks such as medical image restoration, registration, and segmentation. The most popular vision model, Convolutional Neural Networks (CNNs), has reached bottlenecks due to the intrinsic locality of convolution operations. Recently, transformers have been widely adopted for dense prediction for their capability to capture long-range visual dependence. However, due to the high computational complexity and large memory consumption of self-attention operations, transformers are usually used at downsampled feature resolutions. Such usage cannot effectively leverage the tissue-level textural information available only at the full image resolution. This textural information is crucial for medical dense prediction as it can differentiate the subtle human anatomy in medical images. In this study, we hypothesize that Multi-layer Perceptrons (MLPs) are superior alternatives to transformers in medical dense prediction where tissue-level details dominate the performance, as MLPs enable long-range dependence at the full image resolution. To validate our hypothesis, we develop a full-resolution hierarchical MLP framework that uses MLPs beginning from the full image resolution. We evaluate this framework with various MLP blocks on a wide range of medical dense prediction tasks including restoration, registration, and segmentation. Extensive experiments on six public well-benchmarked datasets show that, by simply using MLPs at full resolution, our framework outperforms its CNN and transformer counterparts and achieves state-of-the-art performance on various medical dense prediction tasks.
CVSep 7, 2024
SGSeg: Enabling Text-free Inference in Language-guided Segmentation of Chest X-rays via Self-guidanceShuchang Ye, Mingyuan Meng, Mingjian Li et al.
Segmentation of infected areas in chest X-rays is pivotal for facilitating the accurate delineation of pulmonary structures and pathological anomalies. Recently, multi-modal language-guided image segmentation methods have emerged as a promising solution for chest X-rays where the clinical text reports, depicting the assessment of the images, are used as guidance. Nevertheless, existing language-guided methods require clinical reports alongside the images, and hence, they are not applicable for use in image segmentation in a decision support context, but rather limited to retrospective image analysis after clinical reporting has been completed. In this study, we propose a self-guided segmentation framework (SGSeg) that leverages language guidance for training (multi-modal) while enabling text-free inference (uni-modal), which is the first that enables text-free inference in language-guided segmentation. We exploit the critical location information of both pulmonary and pathological structures depicted in the text reports and introduce a novel localization-enhanced report generation (LERG) module to generate clinical reports for self-guidance. Our LERG integrates an object detector and a location-based attention aggregator, weakly-supervised by a location-aware pseudo-label extraction module. Extensive experiments on a well-benchmarked QaTa-COV19 dataset demonstrate that our SGSeg achieved superior performance than existing uni-modal segmentation methods and closely matched the state-of-the-art performance of multi-modal language-guided segmentation methods.
IVAug 2, 2024
3DPX: Progressive 2D-to-3D Oral Image Reconstruction with Hybrid MLP-CNN NetworksXiaoshuang Li, Mingyuan Meng, Zimo Huang et al.
Panoramic X-ray (PX) is a prevalent modality in dental practice for its wide availability and low cost. However, as a 2D projection image, PX does not contain 3D anatomical information, and therefore has limited use in dental applications that can benefit from 3D information, e.g., tooth angular misa-lignment detection and classification. Reconstructing 3D structures directly from 2D PX has recently been explored to address limitations with existing methods primarily reliant on Convolutional Neural Networks (CNNs) for direct 2D-to-3D mapping. These methods, however, are unable to correctly infer depth-axis spatial information. In addition, they are limited by the in-trinsic locality of convolution operations, as the convolution kernels only capture the information of immediate neighborhood pixels. In this study, we propose a progressive hybrid Multilayer Perceptron (MLP)-CNN pyra-mid network (3DPX) for 2D-to-3D oral PX reconstruction. We introduce a progressive reconstruction strategy, where 3D images are progressively re-constructed in the 3DPX with guidance imposed on the intermediate recon-struction result at each pyramid level. Further, motivated by the recent ad-vancement of MLPs that show promise in capturing fine-grained long-range dependency, our 3DPX integrates MLPs and CNNs to improve the semantic understanding during reconstruction. Extensive experiments on two large datasets involving 464 studies demonstrate that our 3DPX outperforms state-of-the-art 2D-to-3D oral reconstruction methods, including standalone MLP and transformers, in reconstruction quality, and also im-proves the performance of downstream angular misalignment classification tasks.
IVSep 27, 2024
3DPX: Single Panoramic X-ray Analysis Guided by 3D Oral Structure ReconstructionXiaoshuang Li, Zimo Huang, Mingyuan Meng et al.
Panoramic X-ray (PX) is a prevalent modality in dentistry practice owing to its wide availability and low cost. However, as a 2D projection of a 3D structure, PX suffers from anatomical information loss and PX diagnosis is limited compared to that with 3D imaging modalities. 2D-to-3D reconstruction methods have been explored for the ability to synthesize the absent 3D anatomical information from 2D PX for use in PX image analysis. However, there are challenges in leveraging such 3D synthesized reconstructions. First, inferring 3D depth from 2D images remains a challenging task with limited accuracy. The second challenge is the joint analysis of 2D PX with its 3D synthesized counterpart, with the aim to maximize the 2D-3D synergy while minimizing the errors arising from the synthesized image. In this study, we propose a new method termed 3DPX - PX image analysis guided by 2D-to-3D reconstruction, to overcome these challenges. 3DPX consists of (i) a novel progressive reconstruction network to improve 2D-to-3D reconstruction and, (ii) a contrastive-guided bidirectional multimodality alignment module for 3D-guided 2D PX classification and segmentation tasks. The reconstruction network progressively reconstructs 3D images with knowledge imposed on the intermediate reconstructions at multiple pyramid levels and incorporates Multilayer Perceptrons to improve semantic understanding. The downstream networks leverage the reconstructed images as 3D anatomical guidance to the PX analysis through feature alignment, which increases the 2D-3D synergy with bidirectional feature projection and decease the impact of potential errors with contrastive guidance. Extensive experiments on two oral datasets involving 464 studies demonstrate that 3DPX outperforms the state-of-the-art methods in various tasks including 2D-to-3D reconstruction, PX classification and lesion segmentation.
53.3CVMay 17
RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric ProjectionShuchang Ye, Mingyuan Meng, Hao Wang et al.
Anatomical structure labels for chest radiographs are essential for medical image segmentation and a broad range of downstream diagnostic tasks. However, annotating anatomy directly on 2D chest radiographs is labor-intensive and intrinsically ambiguous, as 3D anatomical structures are projected onto a single 2D plane where boundaries may overlap, be occluded, or appear only partially visible. Consequently, existing anatomy-labeled chest radiograph datasets remain limited in scale, anatomy coverage, and label reliability. To address these limitations, we introduce RadGenome-Anatomy, the largest anatomy-labeled chest radiograph dataset, containing over 10 million segmentation masks across 210 anatomical structures in 25,692 studies. It is constructed by projecting large-scale 3D anatomical masks from CT volumes into 2D radiographic space through canonical radiographic geometry. This shifts annotation from directly tracing uncertain 2D boundaries to defining anatomy in volumetric space, where structures that overlap or become partially invisible in radiographs remain spatially separable. As a result, each 2D mask represents the physically grounded projected footprint of a volumetrically defined structure. The scale and broad anatomical coverage of RadGenome-Anatomy, including structures that are overlapping, partially visible, or difficult to delineate directly, enable research on geometric measurements as explicit evidence for chest radiograph interpretation. We demonstrate this by training XAnatomy to predict structure-specific masks and derive clinically relevant measurements, achieving diagnostic accuracies of 96.4%, 95.6%, and 89.2% for cardiomegaly, kyphosis, and scoliosis, respectively.
CVJun 1, 2020Code
High-quality Panorama Stitching based on Asymmetric Bidirectional Optical FlowMingyuan Meng, Shaojun Liu
In this paper, we propose a panorama stitching algorithm based on asymmetric bidirectional optical flow. This algorithm expects multiple photos captured by fisheye lens cameras as input, and then, through the proposed algorithm, these photos can be merged into a high-quality 360-degree spherical panoramic image. For photos taken from a distant perspective, the parallax among them is relatively small, and the obtained panoramic image can be nearly seamless and undistorted. For photos taken from a close perspective or with a relatively large parallax, a seamless though partially distorted panoramic image can also be obtained. Besides, with the help of Graphics Processing Unit (GPU), this algorithm can complete the whole stitching process at a very fast speed: typically, it only takes less than 30s to obtain a panoramic image of 9000-by-4000 pixels, which means our panorama stitching algorithm is of high value in many real-time applications. Our code is available at https://github.com/MungoMeng/Panorama-OpticalFlow.
CVMay 22, 2025
MedCFVQA: A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question AnsweringShuchang Ye, Usman Naseem, Mingyuan Meng et al.
Medical Visual Question Answering (MedVQA) is crucial for enhancing the efficiency of clinical diagnosis by providing accurate and timely responses to clinicians' inquiries regarding medical images. Existing MedVQA models suffered from modality preference bias, where predictions are heavily dominated by one modality while overlooking the other (in MedVQA, usually questions dominate the answer but images are overlooked), thereby failing to learn multimodal knowledge. To overcome the modality preference bias, we proposed a Medical CounterFactual VQA (MedCFVQA) model, which trains with bias and leverages causal graphs to eliminate the modality preference bias during inference. Existing MedVQA datasets exhibit substantial prior dependencies between questions and answers, which results in acceptable performance even if the model significantly suffers from the modality preference bias. To address this issue, we reconstructed new datasets by leveraging existing MedVQA datasets and Changed their P3rior dependencies (CP) between questions and their answers in the training and test set. Extensive experiments demonstrate that MedCFVQA significantly outperforms its non-causal counterpart on both SLAKE, RadVQA and SLAKE-CP, RadVQA-CP datasets.
CVDec 18, 2024
Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive AlignmentsMingjian Li, Mingyuan Meng, Shuchang Ye et al.
Medical image segmentation is crucial in modern medical image analysis, which can aid into diagnosis of various disease conditions. Recently, language-guided segmentation methods have shown promising results in automating image segmentation where text reports are incorporated as guidance. These text reports, containing image impressions and insights given by clinicians, provides auxiliary guidance. However, these methods neglect the inherent pattern gaps between the two distinct modalities, which leads to sub-optimal image-text feature fusion without proper cross-modality feature alignments. Contrastive alignments are widely used to associate image-text semantics in representation learning; however, it has not been exploited to bridge the pattern gaps in language-guided segmentation that relies on subtle low level image details to represent diseases. Existing contrastive alignment methods typically algin high-level global image semantics without involving low-level, localized target information, and therefore fails to explore fine-grained text guidance for language-guided segmentation. In this study, we propose a language-guided segmentation network with Target-informed Multi-level Contrastive Alignments (TMCA). TMCA enables target-informed cross-modality alignments and fine-grained text guidance to bridge the pattern gaps in language-guided segmentation. Specifically, we introduce: 1) a target-sensitive semantic distance module that enables granular image-text alignment modelling, and 2) a multi-level alignment strategy that directs text guidance on low-level image features. In addition, a language-guided target enhancement module is proposed to leverage the aligned text to redirect attention to focus on critical localized image features. Extensive experiments on 4 image-text datasets, involving 3 medical imaging modalities, demonstrated that our TMCA achieved superior performances.
CVSep 14, 2025
Toward Next-generation Medical Vision Backbones: Modeling Finer-grained Long-range Visual DependencyMingyuan Meng
Medical Image Computing (MIC) is a broad research topic covering both pixel-wise (e.g., segmentation, registration) and image-wise (e.g., classification, regression) vision tasks. Effective analysis demands models that capture both global long-range context and local subtle visual characteristics, necessitating fine-grained long-range visual dependency modeling. Compared to Convolutional Neural Networks (CNNs) that are limited by intrinsic locality, transformers excel at long-range modeling; however, due to the high computational loads of self-attention, transformers typically cannot process high-resolution features (e.g., full-scale image features before downsampling or patch embedding) and thus face difficulties in modeling fine-grained dependency among subtle medical image details. Concurrently, Multi-layer Perceptron (MLP)-based visual models are recognized as computation/memory-efficient alternatives in modeling long-range visual dependency but have yet to be widely investigated in the MIC community. This doctoral research advances deep learning-based MIC by investigating effective long-range visual dependency modeling. It first presents innovative use of transformers for both pixel- and image-wise medical vision tasks. The focus then shifts to MLPs, pioneeringly developing MLP-based visual models to capture fine-grained long-range visual dependency in medical images. Extensive experiments confirm the critical role of long-range dependency modeling in MIC and reveal a key finding: MLPs provide feasibility in modeling finer-grained long-range dependency among higher-resolution medical features containing enriched anatomical/pathological details. This finding establishes MLPs as a superior paradigm over transformers/CNNs, consistently enhancing performance across various medical vision tasks and paving the way for next-generation medical vision backbones.
CVJul 15, 2025
Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic ApproximationShuchang Ye, Usman Naseem, Mingyuan Meng et al.
Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.
IVDec 24, 2024
Advancing Deformable Medical Image Registration with Multi-axis Cross-covariance AttentionMingyuan Meng, Michael Fulham, Lei Bi et al.
Deformable image registration is a fundamental requirement for medical image analysis. Recently, transformers have been widely used in deep learning-based registration methods for their ability to capture long-range dependency via self-attention (SA). However, the high computation and memory loads of SA (growing quadratically with the spatial resolution) hinder transformers from processing subtle textural information in high-resolution image features, e.g., at the full and half image resolutions. This limits deformable registration as the high-resolution textural information is crucial for finding precise pixel-wise correspondence between subtle anatomical structures. Cross-covariance Attention (XCA), as a "transposed" version of SA that operates across feature channels, has complexity growing linearly with the spatial resolution, providing the feasibility of capturing long-range dependency among high-resolution image features. However, existing XCA-based transformers merely capture coarse global long-range dependency, which are unsuitable for deformable image registration relying primarily on fine-grained local correspondence. In this study, we propose to improve existing deep learning-based registration methods by embedding a new XCA mechanism. To this end, we design an XCA-based transformer block optimized for deformable medical image registration, named Multi-Axis XCA (MAXCA). Our MAXCA serves as a general network block that can be embedded into various registration network architectures. It can capture both global and local long-range dependency among high-resolution image features by applying regional and dilated XCA in parallel via a multi-axis design. Extensive experiments on two well-benchmarked inter-/intra-patient registration tasks with seven public medical datasets demonstrate that our MAXCA block enables state-of-the-art registration performance.
CVJan 24, 2024
Dynamic Traceback Learning for Medical Report GenerationShuchang Ye, Mingyuan Meng, Mingjian Li et al.
Automated medical report generation has demonstrated the potential to significantly reduce the workload associated with time-consuming medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multimodal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.
CVJan 19, 2024
Enhancing medical vision-language contrastive learning via inter-matching relation modellingMingjian Li, Mingyuan Meng, Michael Fulham et al.
Medical image representations can be learned through medical vision-language contrastive learning (mVLCL) where medical imaging reports are used as weak supervision through image-text alignment. These learned image representations can be transferred to and benefit various downstream medical vision tasks such as disease classification and segmentation. Recent mVLCL methods attempt to align image sub-regions and the report keywords as local-matchings. However, these methods aggregate all local-matchings via simple pooling operations while ignoring the inherent relations between them. These methods therefore fail to reason between local-matchings that are semantically related, e.g., local-matchings that correspond to the disease word and the location word (semantic-relations), and also fail to differentiate such clinically important local-matchings from others that correspond to less meaningful words, e.g., conjunction words (importance-relations). Hence, we propose a mVLCL method that models the inter-matching relations between local-matchings via a relation-enhanced contrastive learning framework (RECLF). In RECLF, we introduce a semantic-relation reasoning module (SRM) and an importance-relation reasoning module (IRM) to enable more fine-grained report supervision for image representation learning. We evaluated our method using six public benchmark datasets on four downstream tasks, including segmentation, zero-shot classification, linear classification, and cross-modal retrieval. Our results demonstrated the superiority of our RECLF over the state-of-the-art mVLCL methods with consistent improvements across single-modal and cross-modal tasks. These results suggest that our RECLF, by modelling the inter-matching relations, can learn improved medical image representations with better generalization capabilities.
IVMay 17, 2023
AdaMSS: Adaptive Multi-Modality Segmentation-to-Survival Learning for Survival Outcome Prediction from PET/CT ImagesMingyuan Meng, Bingxin Gu, Michael Fulham et al.
Survival prediction is a major concern for cancer management. Deep survival models based on deep learning have been widely adopted to perform end-to-end survival prediction from medical images. Recent deep survival models achieved promising performance by jointly performing tumor segmentation with survival prediction, where the models were guided to extract tumor-related information through Multi-Task Learning (MTL). However, these deep survival models have difficulties in exploring out-of-tumor prognostic information. In addition, existing deep survival models are unable to effectively leverage multi-modality images. Empirically-designed fusion strategies were commonly adopted to fuse multi-modality information via task-specific manually-designed networks, thus limiting the adaptability to different scenarios. In this study, we propose an Adaptive Multi-modality Segmentation-to-Survival model (AdaMSS) for survival prediction from PET/CT images. Instead of adopting MTL, we propose a novel Segmentation-to-Survival Learning (SSL) strategy, where our AdaMSS is trained for tumor segmentation and survival prediction sequentially in two stages. This strategy enables the AdaMSS to focus on tumor regions in the first stage and gradually expand its focus to include other prognosis-related regions in the second stage. We also propose a data-driven strategy to fuse multi-modality information, which realizes adaptive optimization of fusion strategies based on training data during training. With the SSL and data-driven fusion strategies, our AdaMSS is designed as an adaptive model that can self-adapt its focus regions and fusion strategy for different training stages. Extensive experiments with two large clinical datasets show that our AdaMSS outperforms state-of-the-art survival prediction methods.
IVDec 13, 2021
The Brain Tumor Sequence Registration (BraTS-Reg) Challenge: Establishing Correspondence Between Pre-Operative and Follow-up MRI Scans of Diffuse Glioma PatientsBhakti Baheti, Satrajit Chakrabarty, Hamed Akbari et al.
Registration of longitudinal brain MRI scans containing pathologies is challenging due to dramatic changes in tissue appearance. Although there has been progress in developing general-purpose medical image registration techniques, they have not yet attained the requisite precision and reliability for this task, highlighting its inherent complexity. Here we describe the Brain Tumor Sequence Registration (BraTS-Reg) challenge, as the first public benchmark environment for deformable registration algorithms focusing on estimating correspondences between pre-operative and follow-up scans of the same patient diagnosed with a diffuse brain glioma. The BraTS-Reg data comprise de-identified multi-institutional multi-parametric MRI (mpMRI) scans, curated for size and resolution according to a canonical anatomical template, and divided into training, validation, and testing sets. Clinical experts annotated ground truth (GT) landmark points of anatomical locations distinct across the temporal domain. Quantitative evaluation and ranking were based on the Median Euclidean Error (MEE), Robustness, and the determinant of the Jacobian of the displacement field. The top-ranked methodologies yielded similar performance across all evaluation metrics and shared several methodological commonalities, including pre-alignment, deep neural networks, inverse consistency analysis, and test-time instance optimization per-case basis as a post-processing step. The top-ranked method attained the MEE at or below that of the inter-rater variability for approximately 60% of the evaluated landmarks, underscoring the scope for further accuracy and robustness improvements, especially relative to human experts. The aim of BraTS-Reg is to continue to serve as an active resource for research, with the data and online evaluation tools accessible at https://bratsreg.github.io/.
IVSep 16, 2021
DeepMTS: Deep Multi-task Learning for Survival Prediction in Patients with Advanced Nasopharyngeal Carcinoma using Pretreatment PET/CTMingyuan Meng, Bingxin Gu, Lei Bi et al.
Nasopharyngeal Carcinoma (NPC) is a malignant epithelial cancer arising from the nasopharynx. Survival prediction is a major concern for NPC patients, as it provides early prognostic information to plan treatments. Recently, deep survival models based on deep learning have demonstrated the potential to outperform traditional radiomics-based survival prediction models. Deep survival models usually use image patches covering the whole target regions (e.g., nasopharynx for NPC) or containing only segmented tumor regions as the input. However, the models using the whole target regions will also include non-relevant background information, while the models using segmented tumor regions will disregard potentially prognostic information existing out of primary tumors (e.g., local lymph node metastasis and adjacent tissue invasion). In this study, we propose a 3D end-to-end Deep Multi-Task Survival model (DeepMTS) for joint survival prediction and tumor segmentation in advanced NPC from pretreatment PET/CT. Our novelty is the introduction of a hard-sharing segmentation backbone to guide the extraction of local features related to the primary tumors, which reduces the interference from non-relevant background information. In addition, we also introduce a cascaded survival network to capture the prognostic information existing out of primary tumors and further leverage the global tumor information (e.g., tumor size, shape, and locations) derived from the segmentation backbone. Our experiments with two clinical datasets demonstrate that our DeepMTS can consistently outperform traditional radiomics-based survival prediction models and existing deep survival models.
IVMar 9, 2021
Prediction of 5-year Progression-Free Survival in Advanced Nasopharyngeal Carcinoma with Pretreatment PET/CT using Multi-Modality Deep Learning-based RadiomicsBingxin Gu, Mingyuan Meng, Lei Bi et al.
Objective: Deep Learning-based Radiomics (DLR) has achieved great success in medical image analysis and has been considered a replacement for conventional radiomics that relies on handcrafted features. In this study, we aimed to explore the capability of DLR for the prediction of 5-year Progression-Free Survival (PFS) in Nasopharyngeal Carcinoma (NPC) using pretreatment PET/CT. Methods: A total of 257 patients (170/87 in internal/external cohorts) with advanced NPC (TNM stage III or IVa) were enrolled. We developed an end-to-end multi-modality DLR model, in which a 3D convolutional neural network was optimized to extract deep features from pretreatment PET/CT images and predict the probability of 5-year PFS. TNM stage, as a high-level clinical feature, could be integrated into our DLR model to further improve the prognostic performance. To compare conventional radiomics and DLR, 1456 handcrafted features were extracted, and optimal conventional radiomics methods were selected from 54 cross-combinations of 6 feature selection methods and 9 classification methods. In addition, risk group stratification was performed with clinical signature, conventional radiomics signature, and DLR signature. Results: Our multi-modality DLR model using both PET and CT achieved higher prognostic performance than the optimal conventional radiomics method. Furthermore, the multi-modality DLR model outperformed single-modality DLR models using only PET or only CT. For risk group stratification, the conventional radiomics signature and DLR signature enabled significant differences between the high- and low-risk patient groups in both internal and external cohorts, while the clinical signature failed in the external cohort. Conclusion: Our study identified potential prognostic tools for survival prediction in advanced NPC, suggesting that DLR could provide complementary values to the current TNM staging.
CVMar 9, 2021
Enhancing Medical Image Registration via Appearance Adjustment NetworksMingyuan Meng, Lei Bi, Michael Fulham et al.
Deformable image registration is fundamental for many medical image analyses. A key obstacle for accurate image registration lies in image appearance variations such as the variations in texture, intensities, and noise. These variations are readily apparent in medical images, especially in brain images where registration is frequently used. Recently, deep learning-based registration methods (DLRs), using deep neural networks, have shown computational efficiency that is several orders of magnitude faster than traditional optimization-based registration methods (ORs). DLRs rely on a globally optimized network that is trained with a set of training samples to achieve faster registration. DLRs tend, however, to disregard the target-pair-specific optimization inherent in ORs and thus have degraded adaptability to variations in testing samples. This limitation is severe for registering medical images with large appearance variations, especially since few existing DLRs explicitly take into account appearance variations. In this study, we propose an Appearance Adjustment Network (AAN) to enhance the adaptability of DLRs to appearance variations. Our AAN, when integrated into a DLR, provides appearance transformations to reduce the appearance variations during registration. In addition, we propose an anatomy-constrained loss function through which our AAN generates anatomy-preserving transformations. Our AAN has been purposely designed to be readily inserted into a wide range of DLRs and can be trained cooperatively in an unsupervised and end-to-end manner. We evaluated our AAN with three state-of-the-art DLRs on three well-established public datasets of 3D brain magnetic resonance imaging (MRI). The results show that our AAN consistently improved existing DLRs and outperformed state-of-the-art ORs on registration accuracy, while adding a fractional computational load to existing DLRs.
NEOct 19, 2020
SPA: Stochastic Probability Adjustment for System Balance of Unsupervised SNNsXingyu Yang, Mingyuan Meng, Shanlin Xiao et al.
Spiking neural networks (SNNs) receive widespread attention because of their low-power hardware characteristic and brain-like signal response mechanism, but currently, the performance of SNNs is still behind Artificial Neural Networks (ANNs). We build an information theory-inspired system called Stochastic Probability Adjustment (SPA) system to reduce this gap. The SPA maps the synapses and neurons of SNNs into a probability space where a neuron and all connected pre-synapses are represented by a cluster. The movement of synaptic transmitter between different clusters is modeled as a Brownian-like stochastic process in which the transmitter distribution is adaptive at different firing phases. We experimented with a wide range of existing unsupervised SNN architectures and achieved consistent performance improvements. The improvements in classification accuracy have reached 1.99% and 6.29% on the MNIST and EMNIST datasets respectively.
NEJan 29, 2020
Spiking Inception Module for Multi-layer Unsupervised Spiking Neural NetworksMingyuan Meng, Xingyu Yang, Shanlin Xiao et al.
Spiking Neural Network (SNN), as a brain-inspired approach, is attracting attention due to its potential to produce ultra-high-energy-efficient hardware. Competitive learning based on Spike-Timing-Dependent Plasticity (STDP) is a popular method to train an unsupervised SNN. However, previous unsupervised SNNs trained through this method are limited to a shallow network with only one learnable layer and cannot achieve satisfactory results when compared with multi-layer SNNs. In this paper, we eased this limitation by: 1)We proposed a Spiking Inception (Sp-Inception) module, inspired by the Inception module in the Artificial Neural Network (ANN) literature. This module is trained through STDP-based competitive learning and outperforms the baseline modules on learning capability, learning efficiency, and robustness. 2)We proposed a Pooling-Reshape-Activate (PRA) layer to make the Sp-Inception module stackable. 3)We stacked multiple Sp-Inception modules to construct multi-layer SNNs. Our algorithm outperforms the baseline algorithms on the hand-written digit classification task, and reaches state-of-the-art results on the MNIST dataset among the existing unsupervised SNNs.
NEDec 2, 2019
High-parallelism Inception-like Spiking Neural Networks for Unsupervised Feature LearningMingyuan Meng, Xingyu Yang, Lei Bi et al.
Spiking Neural Networks (SNNs) are brain-inspired, event-driven machine learning algorithms that have been widely recognized in producing ultra-high-energy-efficient hardware. Among existing SNNs, unsupervised SNNs based on synaptic plasticity, especially Spike-Timing-Dependent Plasticity (STDP), are considered to have great potential in imitating the learning process of the biological brain. Nevertheless, the existing STDP-based SNNs have limitations in constrained learning capability and/or slow learning speed. Most STDP-based SNNs adopted a slow-learning Fully-Connected (FC) architecture and used a sub-optimal vote-based scheme for spike decoding. In this paper, we overcome these limitations with: 1) a design of high-parallelism network architecture, inspired by the Inception module in Artificial Neural Networks (ANNs); 2) use of a Vote-for-All (VFA) decoding layer as a replacement to the standard vote-based spike decoding scheme, to reduce the information loss in spike decoding and, 3) a proposed adaptive repolarization (resetting) mechanism that accelerates SNNs' learning by enhancing spiking activities. Our experimental results on two established benchmark datasets (MNIST/EMNIST) show that our network architecture resulted in superior performance compared to the widely used FC architecture and a more advanced Locally-Connected (LC) architecture, and that our SNN achieved competitive results with state-of-the-art unsupervised SNNs (95.64%/80.11% accuracy on the MNIST/EMNISE dataset) while having superior learning efficiency and robustness against hardware damage. Our SNN achieved great classification accuracy with only hundreds of training iterations, and random destruction of large numbers of synapses or neurons only led to negligible performance degradation.