CVJun 17, 2022Code
CDNet: Contrastive Disentangled Network for Fine-Grained Image Categorization of Ocular B-Scan UltrasoundRuilong Dan, Yunxiang Li, Yijie Wang et al.
Precise and rapid categorization of images in the B-scan ultrasound modality is vital for diagnosing ocular diseases. Nevertheless, distinguishing various diseases in ultrasound still challenges experienced ophthalmologists. Thus a novel contrastive disentangled network (CDNet) is developed in this work, aiming to tackle the fine-grained image categorization (FGIC) challenges of ocular abnormalities in ultrasound images, including intraocular tumor (IOT), retinal detachment (RD), posterior scleral staphyloma (PSS), and vitreous hemorrhage (VH). Three essential components of CDNet are the weakly-supervised lesion localization module (WSLL), contrastive multi-zoom (CMZ) strategy, and hyperspherical contrastive disentangled loss (HCD-Loss), respectively. These components facilitate feature disentanglement for fine-grained recognition in both the input and output aspects. The proposed CDNet is validated on our ZJU Ocular Ultrasound Dataset (ZJUOUSD), consisting of 5213 samples. Furthermore, the generalization ability of CDNet is validated on two public and widely-used chest X-ray FGIC benchmarks. Quantitative and qualitative results demonstrate the efficacy of our proposed CDNet, which achieves state-of-the-art performance in the FGIC task. Code is available at: https://github.com/ZeroOneGame/CDNet-for-OUS-FGIC .
CVDec 2, 2025Code
MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS RegistrationYaqi Wang, Zhi Li, Chengyu Wu et al.
Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.
IVAug 2, 2022
CTooth+: A Large-scale Dental Cone Beam Computed Tomography Dataset and Benchmark for Tooth Volume SegmentationWeiwei Cui, Yaqi Wang, Yilong Li et al.
Accurate tooth volume segmentation is a prerequisite for computer-aided dental analysis. Deep learning-based tooth segmentation methods have achieved satisfying performances but require a large quantity of tooth data with ground truth. The dental data publicly available is limited meaning the existing methods can not be reproduced, evaluated and applied in clinical practice. In this paper, we establish a 3D dental CBCT dataset CTooth+, with 22 fully annotated volumes and 146 unlabeled volumes. We further evaluate several state-of-the-art tooth volume segmentation strategies based on fully-supervised learning, semi-supervised learning and active learning, and define the performance principles. This work provides a new benchmark for the tooth volume segmentation task, and the experiment can serve as the baseline for future AI-based dental imaging research and clinical application development.
IVMar 8, 2022
Plug-and-play Shape Refinement Framework for Multi-site and Lifespan Brain Skull StrippingYunxiang Li, Ruilong Dan, Shuai Wang et al.
Skull stripping is a crucial prerequisite step in the analysis of brain magnetic resonance images (MRI). Although many excellent works or tools have been proposed, they suffer from low generalization capability. For instance, the model trained on a dataset with specific imaging parameters cannot be well applied to other datasets with different imaging parameters. Especially, for the lifespan datasets, the model trained on an adult dataset is not applicable to an infant dataset due to the large domain difference. To address this issue, numerous methods have been proposed, where domain adaptation based on feature alignment is the most common. Unfortunately, this method has some inherent shortcomings, which need to be retrained for each new domain and requires concurrent access to the input images of both domains. In this paper, we design a plug-and-play shape refinement (PSR) framework for multi-site and lifespan skull stripping. To deal with the domain shift between multi-site lifespan datasets, we take advantage of the brain shape prior, which is invariant to imaging parameters and ages. Experiments demonstrate that our framework can outperform the state-of-the-art methods on multi-site lifespan datasets.
CVJun 17, 2022
CTooth: A Fully Annotated 3D Dataset and Benchmark for Tooth Volume Segmentation on Cone Beam Computed Tomography ImagesWeiwei Cui, Yaqi Wang, Qianni Zhang et al.
3D tooth segmentation is a prerequisite for computer-aided dental diagnosis and treatment. However, segmenting all tooth regions manually is subjective and time-consuming. Recently, deep learning-based segmentation methods produce convincing results and reduce manual annotation efforts, but it requires a large quantity of ground truth for training. To our knowledge, there are few tooth data available for the 3D segmentation study. In this paper, we establish a fully annotated cone beam computed tomography dataset CTooth with tooth gold standard. This dataset contains 22 volumes (7363 slices) with fine tooth labels annotated by experienced radiographic interpreters. To ensure a relative even data sampling distribution, data variance is included in the CTooth including missing teeth and dental restoration. Several state-of-the-art segmentation methods are evaluated on this dataset. Afterwards, we further summarise and apply a series of 3D attention-based Unet variants for segmenting tooth volumes. This work provides a new benchmark for the tooth volume segmentation task. Experimental evidence proves that attention modules of the 3D UNet structure boost responses in tooth areas and inhibit the influence of background and noise. The best performance is achieved by 3D Unet with SKNet attention module, of 88.04 \% Dice and 78.71 \% IOU, respectively. The attention-based Unet framework outperforms other state-of-the-art methods on the CTooth dataset. The codebase and dataset are released.
CVSep 18, 2024Code
SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with MambaXiangning Zhang, Qingwei Zhang, Jinnan Chen et al.
Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially developed for early gastric cancer treatment and has expanded to address diverse gastrointestinal lesions. While computer-assisted surgery (CAS) systems enhance ESD precision and safety, their efficacy hinges on accurate real-time surgical phase recognition, a task complicated by ESD's inherent complexity, including heterogeneous lesion characteristics and dynamic tissue interactions. Existing video-based phase recognition algorithms, constrained by inefficient temporal context modeling, exhibit limited performance in capturing fine-grained phase transitions and long-range dependencies. To overcome these limitations, we propose SPRMamba, a novel framework integrating a Mamba-based architecture with a Scaled Residual TranMamba (SRTM) block to synergize long-term temporal modeling and localized detail extraction. SPRMamba further introduces the Hierarchical Sampling Strategy to optimize computational efficiency, enabling real-time processing critical for clinical deployment. Evaluated on the ESD385 dataset and the cholecystectomy benchmark Cholec80, SPRMamba achieves state-of-the-art performance (87.64% accuracy on ESD385, +1.0% over prior methods), demonstrating robust generalizability across surgical workflows. This advancement bridges the gap between computational efficiency and temporal sensitivity, offering a transformative tool for intraoperative guidance and skill assessment in ESD surgery. The code is accessible at https://github.com/Zxnyyyyy/SPRMamba.
CVJun 17, 2022
DU-Net based Unsupervised Contrastive Learning for Cancer Segmentation in Histology ImagesYilong Li, Yaqi Wang, Huiyu Zhou et al.
In this paper, we introduce an unsupervised cancer segmentation framework for histology images. The framework involves an effective contrastive learning scheme for extracting distinctive visual representations for segmentation. The encoder is a Deep U-Net (DU-Net) structure that contains an extra fully convolution layer compared to the normal U-Net. A contrastive learning scheme is developed to solve the problem of lacking training sets with high-quality annotations on tumour boundaries. A specific set of data augmentation techniques are employed to improve the discriminability of the learned colour features from contrastive learning. Smoothing and noise elimination are conducted using convolutional Conditional Random Fields. The experiments demonstrate competitive performance in segmentation even better than some popular supervised networks.
CVJun 5, 2023
Enhancing Point Annotations with Superpixel and Confidence Learning Guided for Improving Semi-Supervised OCT Fluid SegmentationTengjin Weng, Yang Shen, Kai Jin et al.
Automatic segmentation of fluid in Optical Coherence Tomography (OCT) images is beneficial for ophthalmologists to make an accurate diagnosis. Although semi-supervised OCT fluid segmentation networks enhance their performance by introducing additional unlabeled data, the performance enhancement is limited. To address this, we propose Superpixel and Confident Learning Guide Point Annotations Network (SCLGPA-Net) based on the teacher-student architecture, which can learn OCT fluid segmentation from limited fully-annotated data and abundant point-annotated data. Specifically, we use points to annotate fluid regions in unlabeled OCT images and the Superpixel-Guided Pseudo-Label Generation (SGPLG) module generates pseudo-labels and pixel-level label trust maps from the point annotations. The label trust maps provide an indication of the reliability of the pseudo-labels. Furthermore, we propose the Confident Learning Guided Label Refinement (CLGLR) module identifies error information in the pseudo-labels and leads to further refinement. Experiments on the RETOUCH dataset show that we are able to reduce the need for fully-annotated data by 94.22\%, closing the gap with the best fully supervised baselines to a mean IoU of only 2\%. Furthermore, We constructed a private 2D OCT fluid segmentation dataset for evaluation. Compared with other methods, comprehensive experimental results demonstrate that the proposed method can achieve excellent performance in OCT fluid segmentation.
NAMar 8, 2019
A highly parallel multilevel Newton-Krylov-Schwarz method with subspace-based coarsening and partition-based balancing for the multigroup neutron transport equations on 3D unstructured meshesFande Kong, Yaqi Wang, Derek R. Gaston et al.
The multigroup neutron transport equations have been widely used to study the motion of neutrons and their interactions with the background materials. Numerical simulation of the multigroup neutron transport equations is computationally challenging because the equations is defined on a high dimensional phase space (1D in energy, 2D in angle, and 3D in spatial space), and furthermore, for realistic applications, the computational spatial domain is complex and the materials are heterogeneous. The multilevel domain decomposition methods is one of the most popular algorithms for solving the multigroup neutron transport equations, but the construction of coarse spaces is expensive and often not strongly scalable when the number of processor cores is large. In this paper, we study a highly parallel multilevel Newton-Krylov-Schwarz method equipped with several novel components, such as subspace-based coarsening, partition-based balancing and hierarchical mesh partitioning, that enable the overall simulation strongly scalable in terms of the compute time. Compared with the traditional coarsening method, the subspace-based coarsening algorithm significantly reduces the cost of the preconditioner setup that is often unscalable. In addition, the partition-based balancing strategy enhances the parallel efficiency of the overall solver by assigning a nearly-equal amount of work to each processor core. The hierarchical mesh partitioning is able to generate a large number of subdomains and meanwhile minimizes the off-node communication. We numerically show that the proposed algorithm is scalable with more than 10,000 processor cores for a realistic application with a few billions unknowns on 3D unstructured meshes.
CVJul 18, 2024
STS MICCAI 2023 Challenge: Grand challenge on 2D and 3D semi-supervised tooth segmentationYaqi Wang, Yifan Zhang, Xiaodiao Chen et al.
Computer-aided design (CAD) tools are increasingly popular in modern dental practice, particularly for treatment planning or comprehensive prognosis evaluation. In particular, the 2D panoramic X-ray image efficiently detects invisible caries, impacted teeth and supernumerary teeth in children, while the 3D dental cone beam computed tomography (CBCT) is widely used in orthodontics and endodontics due to its low radiation dose. However, there is no open-access 2D public dataset for children's teeth and no open 3D dental CBCT dataset, which limits the development of automatic algorithms for segmenting teeth and analyzing diseases. The Semi-supervised Teeth Segmentation (STS) Challenge, a pioneering event in tooth segmentation, was held as a part of the MICCAI 2023 ToothFairy Workshop on the Alibaba Tianchi platform. This challenge aims to investigate effective semi-supervised tooth segmentation algorithms to advance the field of dentistry. In this challenge, we provide two modalities including the 2D panoramic X-ray images and the 3D CBCT tooth volumes. In Task 1, the goal was to segment tooth regions in panoramic X-ray images of both adult and pediatric teeth. Task 2 involved segmenting tooth sections using CBCT volumes. Limited labelled images with mostly unlabelled ones were provided in this challenge prompt using semi-supervised algorithms for training. In the preliminary round, the challenge received registration and result submission by 434 teams, with 64 advancing to the final round. This paper summarizes the diverse methods employed by the top-ranking teams in the STS MICCAI 2023 Challenge.
CVDec 30, 2025Code
Physically-Grounded Manifold Projection Model for Generalizable Metal Artifact Reduction in Dental CBCTZhi Li, Yaqi Wang, Bingtao Ma et al.
Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: https://github.com/ricoleehduu/PGMP.
24.3IRMay 22
TubiFM: Unified Item, Carousel, and Search Ranking for Streaming DiscoveryAlexandre Salle, Chenglei Niu, Suchismit Mahapatra et al.
Personalized discovery systems often train separate models for item ranking, carousel ranking, and search, even though these tasks expose complementary signals from the same viewer journey: watches shape carousel and item ranking, search queries reveal intent even when they do not lead to a catalog match, and watch history helps interpret search as rewatching, continuation, or new discovery. We introduce the user story, a serialized representation that turns a user's cross-surface history - attributes, sessions, watch events with surface and carousel context, and search events - into a single token sequence. By interleaving pretrained language tokens with domain-specific event tokens, user stories let heterogeneous recommendation and search tasks be expressed as prompted next-token prediction over a shared grammar. TubiFM is one instantiation of this approach: a Llama 3.2 1B-based model trained on user stories and prompted to rank items, carousels, or search results without task-specific architectures. In offline evaluation, this single model outperforms specialist baselines across item, carousel, and search ranking. In online A/B tests, TubiFM significantly improves search total viewing time (TVT) by $+3.9\%$ and carousel TVT by $+0.30\%$. Item ranking is statistically neutral on TVT ($+0.14\%$), but matches a mature production stack; across all three tasks, TubiFM serves on L40S GPUs and reduces p99 ranking latency from 500ms to 200ms. These results show that shared user stories can improve discovery while simplifying ranking systems.
IVNov 13, 2023
TTMFN: Two-stream Transformer-based Multimodal Fusion Network for Survival PredictionRuiquan Ge, Xiangyang Hu, Rungen Huang et al.
Survival prediction plays a crucial role in assisting clinicians with the development of cancer treatment protocols. Recent evidence shows that multimodal data can help in the diagnosis of cancer disease and improve survival prediction. Currently, deep learning-based approaches have experienced increasing success in survival prediction by integrating pathological images and gene expression data. However, most existing approaches overlook the intra-modality latent information and the complex inter-modality correlations. Furthermore, existing modalities do not fully exploit the immense representational capabilities of neural networks for feature aggregation and disregard the importance of relationships between features. Therefore, it is highly recommended to address these issues in order to enhance the prediction performance by proposing a novel deep learning-based method. We propose a novel framework named Two-stream Transformer-based Multimodal Fusion Network for survival prediction (TTMFN), which integrates pathological images and gene expression data. In TTMFN, we present a two-stream multimodal co-attention transformer module to take full advantage of the complex relationships between different modalities and the potential connections within the modalities. Additionally, we develop a multi-head attention pooling approach to effectively aggregate the feature representations of the two modalities. The experiment results on four datasets from The Cancer Genome Atlas demonstrate that TTMFN can achieve the best performance or competitive results compared to the state-of-the-art methods in predicting the overall survival of patients.
CVAug 22, 2024
Class-balanced Open-set Semi-supervised Object Detection for Medical ImagesZhanyun Lu, Renshu Gu, Huimin Cheng et al.
Medical image datasets in the real world are often unlabeled and imbalanced, and Semi-Supervised Object Detection (SSOD) can utilize unlabeled data to improve an object detector. However, existing approaches predominantly assumed that the unlabeled data and test data do not contain out-of-distribution (OOD) classes. The few open-set semi-supervised object detection methods have two weaknesses: first, the class imbalance is not considered; second, the OOD instances are distinguished and simply discarded during pseudo-labeling. In this paper, we consider the open-set semi-supervised object detection problem which leverages unlabeled data that contain OOD classes to improve object detection for medical images. Our study incorporates two key innovations: Category Control Embed (CCE) and out-of-distribution Detection Fusion Classifier (OODFC). CCE is designed to tackle dataset imbalance by constructing a Foreground information Library, while OODFC tackles open-set challenges by integrating the ``unknown'' information into basic pseudo-labels. Our method outperforms the state-of-the-art SSOD performance, achieving a 4.25 mAP improvement on the public Parasite dataset.
CVNov 17, 2023
MSE-Nets: Multi-annotated Semi-supervised Ensemble Networks for Improving Segmentation of Medical Image with Ambiguous BoundariesShuai Wang, Tengjin Weng, Jingyi Wang et al.
Medical image segmentation annotations exhibit variations among experts due to the ambiguous boundaries of segmented objects and backgrounds in medical images. Although using multiple annotations for each image in the fully-supervised has been extensively studied for training deep models, obtaining a large amount of multi-annotated data is challenging due to the substantial time and manpower costs required for segmentation annotations, resulting in most images lacking any annotations. To address this, we propose Multi-annotated Semi-supervised Ensemble Networks (MSE-Nets) for learning segmentation from limited multi-annotated and abundant unannotated data. Specifically, we introduce the Network Pairwise Consistency Enhancement (NPCE) module and Multi-Network Pseudo Supervised (MNPS) module to enhance MSE-Nets for the segmentation task by considering two major factors: (1) to optimize the utilization of all accessible multi-annotated data, the NPCE separates (dis)agreement annotations of multi-annotated data at the pixel level and handles agreement and disagreement annotations in different ways, (2) to mitigate the introduction of imprecise pseudo-labels, the MNPS extends the training data by leveraging consistent pseudo-labels from unannotated data. Finally, we improve confidence calibration by averaging the predictions of base networks. Experiments on the ISIC dataset show that we reduced the demand for multi-annotated data by 97.75\% and narrowed the gap with the best fully-supervised baseline to just a Jaccard index of 4\%. Furthermore, compared to other semi-supervised methods that rely only on a single annotation or a combined fusion approach, the comprehensive experimental results on ISIC and RIGA datasets demonstrate the superior performance of our proposed method in medical image segmentation with ambiguous boundaries.
IVDec 20, 2024Code
Efficient MedSAMs: Segment Anything in Medical Images on LaptopJun Ma, Feifei Li, Sumin Kim et al.
Promptable segmentation foundation models have emerged as a transformative approach to addressing the diverse needs in medical images, but most existing models require expensive computing, posing a big barrier to their adoption in clinical practice. In this work, we organized the first international competition dedicated to promptable medical image segmentation, featuring a large-scale dataset spanning nine common imaging modalities from over 20 different institutions. The top teams developed lightweight segmentation foundation models and implemented an efficient inference pipeline that substantially reduced computational requirements while maintaining state-of-the-art segmentation accuracy. Moreover, the post-challenge phase advanced the algorithms through the design of performance booster and reproducibility tasks, resulting in improved algorithms and validated reproducibility of the winning solution. Furthermore, the best-performing algorithms have been incorporated into the open-source software with a user-friendly interface to facilitate clinical adoption. The data and code are publicly available to foster the further development of medical image segmentation foundation models and pave the way for impactful real-world applications.
CLJul 22, 2024
SETTP: Style Extraction and Tunable Inference via Dual-level Transferable Prompt LearningChunzhen Jin, Yongfeng Huang, Yaqi Wang et al.
Text style transfer, an important research direction in natural language processing, aims to adapt the text to various preferences but often faces challenges with limited resources. In this work, we introduce a novel method termed Style Extraction and Tunable Inference via Dual-level Transferable Prompt Learning (SETTP) for effective style transfer in low-resource scenarios. First, SETTP learns source style-level prompts containing fundamental style characteristics from high-resource style transfer. During training, the source style-level prompts are transferred through an attention module to derive a target style-level prompt for beneficial knowledge provision in low-resource style transfer. Additionally, we propose instance-level prompts obtained by clustering the target resources based on the semantic content to reduce semantic bias. We also propose an automated evaluation approach of style similarity based on alignment with human evaluations using ChatGPT-4. Our experiments across three resourceful styles show that SETTP requires only 1/20th of the data volume to achieve performance comparable to state-of-the-art methods. In tasks involving scarce data like writing style and role style, SETTP outperforms previous methods by 16.24\%.
IVMay 15, 2024Code
MMFusion: Multi-modality Diffusion Model for Lymph Node Metastasis Diagnosis in Esophageal CancerChengyu Wu, Chengkai Wang, Yaqi Wang et al.
Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at https://github.com/wuchengyu123/MMFusion.
IVAug 23, 2024
Multi-modal Intermediate Feature Interaction AutoEncoder for Overall Survival Prediction of Esophageal Squamous Cell CancerChengyu Wu, Yatao Zhang, Yaqi Wang et al.
Survival prediction for esophageal squamous cell cancer (ESCC) is crucial for doctors to assess a patient's condition and tailor treatment plans. The application and development of multi-modal deep learning in this field have attracted attention in recent years. However, the prognostically relevant features between cross-modalities have not been further explored in previous studies, which could hinder the performance of the model. Furthermore, the inherent semantic gap between different modal feature representations is also ignored. In this work, we propose a novel autoencoder-based deep learning model to predict the overall survival of the ESCC. Two novel modules were designed for multi-modal prognosis-related feature reinforcement and modeling ability enhancement. In addition, a novel joint loss was proposed to make the multi-modal feature representations more aligned. Comparison and ablation experiments demonstrated that our model can achieve satisfactory results in terms of discriminative ability, risk stratification, and the effectiveness of the proposed modules.
55.3CVMar 17
Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document RetrievalWeiqing Li, Jinyue Guo, Yaqi Wang et al.
Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.
IVNov 28, 2025Code
MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT ImagesYaqi Wang, Zhi Li, Chengyu Wu et al.
Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge confirms the substantial benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants' submitted code have been made publicly available on GitHub (https://github.com/ricoleehduu/STS-Challenge-2024), ensuring transparency and reproducibility.
CVNov 20, 2023
PMP-Swin: Multi-Scale Patch Message Passing Swin Transformer for Retinal Disease ClassificationZhihan Yang, Zhiming Cheng, Tengjin Weng et al.
Retinal disease is one of the primary causes of visual impairment, and early diagnosis is essential for preventing further deterioration. Nowadays, many works have explored Transformers for diagnosing diseases due to their strong visual representation capabilities. However, retinal diseases exhibit milder forms and often present with overlapping signs, which pose great difficulties for accurate multi-class classification. Therefore, we propose a new framework named Multi-Scale Patch Message Passing Swin Transformer for multi-class retinal disease classification. Specifically, we design a Patch Message Passing (PMP) module based on the Message Passing mechanism to establish global interaction for pathological semantic features and to exploit the subtle differences further between different diseases. Moreover, considering the various scale of pathological features we integrate multiple PMP modules for different patch sizes. For evaluation, we have constructed a new dataset, named OPTOS dataset, consisting of 1,033 high-resolution fundus images photographed by Optos camera and conducted comprehensive experiments to validate the efficacy of our proposed method. And the results on both the public dataset and our dataset demonstrate that our method achieves remarkable performance compared to state-of-the-art methods.
IVMar 2, 2025Code
Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue BoundariesYang Ding, Can Han, Sijia Du et al.
Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.
CLJan 21, 2024Code
Large Language Model based Multi-Agents: A Survey of Progress and ChallengesTaicheng Guo, Xiuying Chen, Yaqi Wang et al.
Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents' capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems.
CVSep 30, 2021Code
GT U-Net: A U-Net Like Group Transformer Network for Tooth Root SegmentationYunxiang Li, Shuai Wang, Jun Wang et al.
To achieve an accurate assessment of root canal therapy, a fundamental step is to perform tooth root segmentation on oral X-ray images, in that the position of tooth root boundary is significant anatomy information in root canal therapy evaluation. However, the fuzzy boundary makes the tooth root segmentation very challenging. In this paper, we propose a novel end-to-end U-Net like Group Transformer Network (GT U-Net) for the tooth root segmentation. The proposed network retains the essential structure of U-Net but each of the encoders and decoders is replaced by a group Transformer, which significantly reduces the computational cost of traditional Transformer architectures by using the grouping structure and the bottleneck structure. In addition, the proposed GT U-Net is composed of a hybrid structure of convolution and Transformer, which makes it independent of pre-training weights. For optimization, we also propose a shape-sensitive Fourier Descriptor (FD) loss function to make use of shape prior knowledge. Experimental results show that our proposed network achieves the state-of-the-art performance on our collected tooth root segmentation dataset and the public retina dataset DRIVE. Code has been released at https://github.com/Kent0n-Li/GT-U-Net.
IVNov 11, 2020Code
Multiscale Attention Guided Network for COVID-19 Diagnosis Using Chest X-ray ImagesJingxiong Li, Yaqi Wang, Shuai Wang et al.
Coronavirus disease 2019 (COVID-19) is one of the most destructive pandemic after millennium, forcing the world to tackle a health crisis. Automated lung infections classification using chest X-ray (CXR) images could strengthen diagnostic capability when handling COVID-19. However, classifying COVID-19 from pneumonia cases using CXR image is a difficult task because of shared spatial characteristics, high feature variation and contrast diversity between cases. Moreover, massive data collection is impractical for a newly emerged disease, which limited the performance of data thirsty deep learning models. To address these challenges, Multiscale Attention Guided deep network with Soft Distance regularization (MAG-SD) is proposed to automatically classify COVID-19 from pneumonia CXR images. In MAG-SD, MA-Net is used to produce prediction vector and attention from multiscale feature maps. To improve the robustness of trained model and relieve the shortage of training data, attention guided augmentations along with a soft distance regularization are posed, which aims at generating meaningful augmentations and reduce noise. Our multiscale attention model achieves better classification performance on our pneumonia CXR image dataset. Plentiful experiments are proposed for MAG-SD which demonstrates its unique advantage in pneumonia classification over cutting-edge models. The code is available at https://github.com/JasonLeeGHub/MAG-SD.
MAJan 12
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit SizingGuanyuan Pan, Yugui Lin, Tiansheng Zhou et al.
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches often underutilize circuit schematics and lack the explainability required for industry adoption. To tackle these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-starting from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance, achieving a 100% success rate in optimizing an amplifier with a complementary input and a class-AB output stage, while maintaining total runtime under 43 minutes across all experiments.
CVNov 11, 2025
SASG-DA: Sparse-Aware Semantic-Guided Diffusion Augmentation For Myoelectric Gesture RecognitionChen Liu, Can Han, Weishi Xu et al.
Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Sampling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling (SASS) strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples.
CVNov 29, 2024
Adaptive Interactive Segmentation for Multimodal Medical Imaging via Selection EngineZhi Li, Kai Zhao, Yaqi Wang et al.
In medical image analysis, achieving fast, efficient, and accurate segmentation is essential for automated diagnosis and treatment. Although recent advancements in deep learning have significantly improved segmentation accuracy, current models often face challenges in adaptability and generalization, particularly when processing multi-modal medical imaging data. These limitations stem from the substantial variations between imaging modalities and the inherent complexity of medical data. To address these challenges, we propose the Strategy-driven Interactive Segmentation Model (SISeg), built on SAM2, which enhances segmentation performance across various medical imaging modalities by integrating a selection engine. To mitigate memory bottlenecks and optimize prompt frame selection during the inference of 2D image sequences, we developed an automated system, the Adaptive Frame Selection Engine (AFSE). This system dynamically selects the optimal prompt frames without requiring extensive prior medical knowledge and enhances the interpretability of the model's inference process through an interactive feedback mechanism. We conducted extensive experiments on 10 datasets covering 7 representative medical imaging modalities, demonstrating the SISeg model's robust adaptability and generalization in multi-modal tasks. The project page and code will be available at: [URL].
LGNov 20, 2024
Retrieval-Augmented Generation for Domain-Specific Question Answering: A Case Study on Pittsburgh and CMUHaojia Sun, Yaqi Wang, Shuting Zhang
We designed a Retrieval-Augmented Generation (RAG) system to provide large language models with relevant documents for answering domain-specific questions about Pittsburgh and Carnegie Mellon University (CMU). We extracted over 1,800 subpages using a greedy scraping strategy and employed a hybrid annotation process, combining manual and Mistral-generated question-answer pairs, achieving an inter-annotator agreement (IAA) score of 0.7625. Our RAG framework integrates BM25 and FAISS retrievers, enhanced with a reranker for improved document retrieval accuracy. Experimental results show that the RAG system significantly outperforms a non-RAG baseline, particularly in time-sensitive and complex queries, with an F1 score improvement from 5.45% to 42.21% and recall of 56.18%. This study demonstrates the potential of RAG systems in enhancing answer precision and relevance, while identifying areas for further optimization in document retrieval and model training.
CVJul 4, 2025
CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal DissectionXiangning Zhang, Jinnan Chen, Qingwei Zhang et al.
Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.
ARApr 14, 2025
Graph Neural Networks Based Analog Circuit Link PredictionGuanyuan Pan, Tiansheng Zhou, Jianxiang Zhao et al.
Circuit link prediction, which identifies missing component connections from incomplete netlists, is crucial in analog circuit design automation. However, existing methods face three main challenges: 1) Insufficient use of topological patterns in circuit graphs reduces prediction accuracy; 2) Data scarcity due to the complexity of annotations hinders model generalization; 3) Limited adaptability to various netlist formats restricts model flexibility. We propose Graph Neural Networks Based Analog Circuit Link Prediction (GNN-ACLP), a graph neural networks (GNNs) based method featuring three innovations to tackle these challenges. First, we introduce the SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction) framework and achieve port-level accuracy in circuit link prediction. Second, we propose Netlist Babel Fish, a netlist format conversion tool that leverages retrieval-augmented generation (RAG) with a large language model (LLM) to enhance the compatibility of netlist formats. Finally, we build a comprehensive dataset, SpiceNetlist, comprising 775 annotated circuits of 7 different types across 10 component classes. Experiments demonstrate accuracy improvements of 16.08% on SpiceNetlist, 11.38% on Image2Net, and 16.01% on Masala-CHAI compared to the baseline in intra-dataset evaluation, while maintaining accuracy from 92.05% to 99.07% in cross-dataset evaluation, demonstrating robust feature transfer capabilities. However, its linear computational complexity makes processing large-scale netlists challenging and requires future addressing.
AINov 21, 2024
SRSA: A Cost-Efficient Strategy-Router Search Agent for Real-world Human-Machine InteractionsYaqi Wang, Haipei Xu
Recently, as Large Language Models (LLMs) have shown impressive emerging capabilities and gained widespread popularity, research on LLM-based search agents has proliferated. In real-world situations, users often input contextual and highly personalized queries to chatbots, challenging LLMs to capture context and generate appropriate answers. However, much of the prior research has not focused specifically on authentic human-machine dialogue scenarios. It also ignores the important balance between response quality and computational cost by forcing all queries to follow the same agent process. To address these gaps, we propose a Strategy-Router Search Agent (SRSA), routing different queries to appropriate search strategies and enabling fine-grained serial searches to obtain high-quality results at a relatively low cost. To evaluate our work, we introduce a new dataset, Contextual Query Enhancement Dataset (CQED), comprising contextual queries to simulate authentic and daily interactions between humans and chatbots. Using LLM-based automatic evaluation metrics, we assessed SRSA's performance in terms of informativeness, completeness, novelty, and actionability. To conclude, SRSA provides an approach that resolves the issue of simple serial searches leading to degenerate answers for lengthy and contextual queries, effectively and efficiently parses complex user queries, and generates more comprehensive and informative responses without fine-tuning an LLM.
CLFeb 5, 2022
Semantic Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear FusionPeiying Zhang, Xingzhe Huang, Yaqi Wang et al.
Natural language processing (NLP) task has achieved excellent performance in many fields, including semantic understanding, automatic summarization, image recognition and so on. However, most of the neural network models for NLP extract the text in a fine-grained way, which is not conducive to grasp the meaning of the text from a global perspective. To alleviate the problem, the combination of the traditional statistical method and deep learning model as well as a novel model based on multi model nonlinear fusion are proposed in this paper. The model uses the Jaccard coefficient based on part of speech, Term Frequency-Inverse Document Frequency (TF-IDF) and word2vec-CNN algorithm to measure the similarity of sentences respectively. According to the calculation accuracy of each model, the normalized weight coefficient is obtained and the calculation results are compared. The weighted vector is input into the fully connected neural network to give the final classification results. As a result, the statistical sentence similarity evaluation algorithm reduces the granularity of feature extraction, so it can grasp the sentence features globally. Experimental results show that the matching of sentence similarity calculation method based on multi model nonlinear fusion is 84%, and the F1 value of the model is 75%.
CVOct 28, 2021
Dispensed Transformer Network for Unsupervised Domain AdaptationYunxiang Li, Jingxiong Li, Ruilong Dan et al.
Accurate segmentation is a crucial step in medical image analysis and applying supervised machine learning to segment the organs or lesions has been substantiated effective. However, it is costly to perform data annotation that provides ground truth labels for training the supervised algorithms, and the high variance of data that comes from different domains tends to severely degrade system performance over cross-site or cross-modality datasets. To mitigate this problem, a novel unsupervised domain adaptation (UDA) method named dispensed Transformer network (DTNet) is introduced in this paper. Our novel DTNet contains three modules. First, a dispensed residual transformer block is designed, which realizes global attention by dispensed interleaving operation and deals with the excessive computational cost and GPU memory usage of the Transformer. Second, a multi-scale consistency regularization is proposed to alleviate the loss of details in the low-resolution output for better feature alignment. Finally, a feature ranking discriminator is introduced to automatically assign different weights to domain-gap features to lessen the feature distribution distance, reducing the performance shift of two domains. The proposed method is evaluated on large fluorescein angiography (FA) retinal nonperfusion (RNP) cross-site dataset with 676 images and a wide used cross-modality dataset from the MM-WHS challenge. Extensive results demonstrate that our proposed network achieves the best performance in comparison with several state-of-the-art techniques.
CVOct 10, 2021
3D Object Detection Combining Semantic and Geometric Features from Point CloudsHao Peng, Guofeng Tong, Zheng Li et al.
In this paper, we investigate the combination of voxel-based methods and point-based methods, and propose a novel end-to-end two-stage 3D object detector named SGNet for point clouds scenes. The voxel-based methods voxelize the scene to regular grids, which can be processed with the current advanced feature learning frameworks based on convolutional layers for semantic feature learning. Whereas the point-based methods can better extract the geometric feature of the point due to the coordinate reservations. The combination of the two is an effective solution for 3D object detection from point clouds. However, most current methods use a voxel-based detection head with anchors for final classification and localization. Although the preset anchors cover the entire scene, it is not suitable for point clouds detection tasks with larger scenes and multiple categories due to the limitation of voxel size. In this paper, we propose a voxel-to-point module (VTPM) that captures semantic and geometric features. The VTPM is a Voxel-Point-Based Module that finally implements 3D object detection in point space, which is more conducive to the detection of small-size objects and avoids the presets of anchors in inference stage. In addition, a Confidence Adjustment Module (CAM) with the center-boundary-aware confidence attention is proposed to solve the misalignment between the predicted confidence and proposals in the regions of the interest (RoI) selection. The SGNet proposed in this paper has achieved state-of-the-art results for 3D object detection in the KITTI dataset, especially in the detection of small-size objects such as cyclists. Actually, as of September 19, 2021, for KITTI dataset, SGNet ranked 1st in 3D and BEV detection on cyclists with easy difficulty level, and 2nd in the 3D detection of moderate cyclists.
IVSep 26, 2021
Structure-aware scale-adaptive networks for cancer segmentation in whole-slide imagesYibao Sun, Giussepi Lopez, Yaqi Wang et al.
Cancer segmentation in whole-slide images is a fundamental step for viable tumour burden estimation, which is of great value for cancer assessment. However, factors like vague boundaries or small regions dissociated from viable tumour areas make it a challenging task. Considering the usefulness of multi-scale features in various vision-related tasks, we present a structure-aware scale-adaptive feature selection method for efficient and accurate cancer segmentation. Based on a segmentation network with a popular encoder-decoder architecture, a scale-adaptive module is proposed for selecting more robust features to represent the vague, non-rigid boundaries. Furthermore, a structural similarity metric is proposed for better tissue structure awareness to deal with small region segmentation. In addition, advanced designs including several attention mechanisms and the selective-kernel convolutions are applied to the baseline network for comparative study purposes. Extensive experimental results show that the proposed structure-aware scale-adaptive networks achieve outstanding performance on liver cancer segmentation when compared to top ten submitted results in the challenge of PAIP 2019. Further evaluation on colorectal cancer segmentation shows that the scale-adaptive module improves the baseline network or outperforms the other excellent designs of attention mechanisms when considering the tradeoff between efficiency and accuracy.
CVJul 2, 2021
Magnification-independent Histopathological Image Classification with Similarity-based Multi-scale EmbeddingsYibao Sun, Xingru Huang, Yaqi Wang et al.
The classification of histopathological images is of great value in both cancer diagnosis and pathological studies. However, multiple reasons, such as variations caused by magnification factors and class imbalance, make it a challenging task where conventional methods that learn from image-label datasets perform unsatisfactorily in many cases. We observe that tumours of the same class often share common morphological patterns. To exploit this fact, we propose an approach that learns similarity-based multi-scale embeddings (SMSE) for magnification-independent histopathological image classification. In particular, a pair loss and a triplet loss are leveraged to learn similarity-based embeddings from image pairs or image triplets. The learned embeddings provide accurate measurements of similarities between images, which are regarded as a more effective form of representation for histopathological morphology than normal image features. Furthermore, in order to ensure the generated models are magnification-independent, images acquired at different magnification factors are simultaneously fed to networks during training for learning multi-scale embeddings. In addition to the SMSE, to eliminate the impact of class imbalance, instead of using the hard sample mining strategy that intuitively discards some easy samples, we introduce a new reinforced focal loss to simultaneously punish hard misclassified samples while suppressing easy well-classified samples. Experimental results show that the SMSE improves the performance for histopathological image classification tasks for both breast and liver cancers by a large margin compared to previous methods. In particular, the SMSE achieves the best performance on the BreakHis benchmark with an improvement ranging from 5% to 18% compared to previous methods using traditional features.
CVMay 2, 2021
AGMB-Transformer: Anatomy-Guided Multi-Branch Transformer Network for Automated Evaluation of Root Canal TherapyYunxiang Li, Guodong Zeng, Yifan Zhang et al.
Accurate evaluation of the treatment result on X-ray images is a significant and challenging step in root canal therapy since the incorrect interpretation of the therapy results will hamper timely follow-up which is crucial to the patients' treatment outcome. Nowadays, the evaluation is performed in a manual manner, which is time-consuming, subjective, and error-prone. In this paper, we aim to automate this process by leveraging the advances in computer vision and artificial intelligence, to provide an objective and accurate method for root canal therapy result assessment. A novel anatomy-guided multi-branch Transformer (AGMB-Transformer) network is proposed, which first extracts a set of anatomy features and then uses them to guide a multi-branch Transformer network for evaluation. Specifically, we design a polynomial curve fitting segmentation strategy with the help of landmark detection to extract the anatomy features. Moreover, a branch fusion module and a multi-branch structure including our progressive Transformer and Group Multi-Head Self-Attention (GMHSA) are designed to focus on both global and local features for an accurate diagnosis. To facilitate the research, we have collected a large-scale root canal therapy evaluation dataset with 245 root canal therapy X-ray images, and the experiment results show that our AGMB-Transformer can improve the diagnosis accuracy from 57.96% to 90.20% compared with the baseline network. The proposed AGMB-Transformer can achieve a highly accurate evaluation of root canal therapy. To our best knowledge, our work is the first to perform automatic root canal therapy evaluation and has important clinical value to reduce the workload of endodontists.
CVMar 7, 2021
High-Resolution Segmentation of Tooth Root Fuzzy Edge Based on Polynomial Curve Fitting with Landmark DetectionYunxiang Li, Yifan Zhang, Yaqi Wang et al.
As the most economical and routine auxiliary examination in the diagnosis of root canal treatment, oral X-ray has been widely used by stomatologists. It is still challenging to segment the tooth root with a blurry boundary for the traditional image segmentation method. To this end, we propose a model for high-resolution segmentation based on polynomial curve fitting with landmark detection (HS-PCL). It is based on detecting multiple landmarks evenly distributed on the edge of the tooth root to fit a smooth polynomial curve as the segmentation of the tooth root, thereby solving the problem of fuzzy edge. In our model, a maximum number of the shortest distances algorithm (MNSDA) is proposed to automatically reduce the negative influence of the wrong landmarks which are detected incorrectly and deviate from the tooth root on the fitting result. Our numerical experiments demonstrate that the proposed approach not only reduces Hausdorff95 (HD95) by 33.9% and Average Surface Distance (ASD) by 42.1% compared with the state-of-the-art method, but it also achieves excellent results on the minute quantity of datasets, which greatly improves the feasibility of automatic root canal therapy evaluation by medical image computing.
IVMay 1, 2020
An Adaptive Enhancement Based Hybrid CNN Model for Digital Dental X-ray Positions ClassificationYaqi Wang, Lingling Sun, Yifang Zhang et al.
Analysis of dental radiographs is an important part of the diagnostic process in daily clinical practice. Interpretation by an expert includes teeth detection and numbering. In this project, a novel solution based on adaptive histogram equalization and convolution neural network (CNN) is proposed, which automatically performs the task for dental x-rays. In order to improve the detection accuracy, we propose three pre-processing techniques to supplement the baseline CNN based on some prior domain knowledge. Firstly, image sharpening and median filtering are used to remove impulse noise, and the edge is enhanced to some extent. Next, adaptive histogram equalization is used to overcome the problem of excessive amplification noise of HE. Finally, a multi-CNN hybrid model is proposed to classify six different locations of dental slices. The results showed that the accuracy and specificity of the test set exceeded 90\%, and the AUC reached 0.97. In addition, four dentists were invited to manually annotate the test data set (independently) and then compare it with the labels obtained by our proposed algorithm. The results show that our method can effectively identify the X-ray location of teeth.
IVMay 1, 2020
A cascade network for Detecting COVID-19 using chest x-raysDailin Lv, Wuteng Qi, Yunxiang Li et al.
The worldwide spread of pneumonia caused by a novel coronavirus poses an unprecedented challenge to the world's medical resources and prevention and control measures. Covid-19 attacks not only the lungs, making it difficult to breathe and life-threatening, but also the heart, kidneys, brain and other vital organs of the body, with possible sequela. At present, the detection of COVID-19 needs to be realized by the reverse transcription-polymerase Chain Reaction (RT-PCR). However, many countries are in the outbreak period of the epidemic, and the medical resources are very limited. They cannot provide sufficient numbers of gene sequence detection, and many patients may not be isolated and treated in time. Given this situation, we researched the analytical and diagnostic capabilities of deep learning on chest radiographs and proposed Cascade-SEMEnet which is cascaded with SEME-ResNet50 and SEME-DenseNet169. The two cascade networks of Cascade - SEMEnet both adopt large input sizes and SE-Structure and use MoEx and histogram equalization to enhance the data. We first used SEME-ResNet50 to screen chest X-ray and diagnosed three classes: normal, bacterial, and viral pneumonia. Then we used SEME-DenseNet169 for fine-grained classification of viral pneumonia and determined if it is caused by COVID-19. To exclude the influence of non-pathological features on the network, we preprocessed the data with U-Net during the training of SEME-DenseNet169. The results showed that our network achieved an accuracy of 85.6\% in determining the type of pneumonia infection and 97.1\% in the fine-grained classification of COVID-19. We used Grad-CAM to visualize the judgment based on the model and help doctors understand the chest radiograph while verifying the effectivene.
IVMar 3, 2020
DDU-Nets: Distributed Dense Model for 3D MRI Brain Tumor SegmentationHanxiao Zhang, Jingxiong Li, Mali Shen et al.
Segmentation of brain tumors and their subregions remains a challenging task due to their weak features and deformable shapes. In this paper, three patterns (cross-skip, skip-1 and skip-2) of distributed dense connections (DDCs) are proposed to enhance feature reuse and propagation of CNNs by constructing tunnels between key layers of the network. For better detecting and segmenting brain tumors from multi-modal 3D MR images, CNN-based models embedded with DDCs (DDU-Nets) are trained efficiently from pixel to pixel with a limited number of parameters. Postprocessing is then applied to refine the segmentation results by reducing the false-positive samples. The proposed method is evaluated on the BraTS 2019 dataset with results demonstrating the effectiveness of the DDU-Nets while requiring less computational cost.
CVAug 13, 2018
BACH: Grand Challenge on Breast Cancer Histology ImagesGuilherme Aresta, Teresa Araújo, Scotty Kwok et al.
Breast cancer is the most common invasive cancer in women, affecting more than 10% of women worldwide. Microscopic analysis of a biopsy remains one of the most important methods to diagnose the type of breast cancer. This requires specialized analysis by pathologists, in a task that i) is highly time- and cost-consuming and ii) often leads to nonconsensual results. The relevance and potential of automatic classification algorithms using hematoxylin-eosin stained histopathological images has already been demonstrated, but the reported results are still sub-optimal for clinical use. With the goal of advancing the state-of-the-art in automatic classification, the Grand Challenge on BreAst Cancer Histology images (BACH) was organized in conjunction with the 15th International Conference on Image Analysis and Recognition (ICIAR 2018). A large annotated dataset, composed of both microscopy and whole-slide images, was specifically compiled and made publicly available for the BACH challenge. Following a positive response from the scientific community, a total of 64 submissions, out of 677 registrations, effectively entered the competition. From the submitted algorithms it was possible to push forward the state-of-the-art in terms of accuracy (87%) in automatic classification of breast cancer with histopathological images. Convolutional neuronal networks were the most successful methodology in the BACH challenge. Detailed analysis of the collective results allowed the identification of remaining challenges in the field and recommendations for future developments. The BACH dataset remains publically available as to promote further improvements to the field of automatic classification in digital pathology.