Jichen Yang

CV
h-index46
19papers
1,307citations
Novelty30%
AI Score45

19 Papers

IVMar 16, 2022Code
Knee arthritis severity measurement using deep learning: a publicly available algorithm with a multi-institutional validation showing radiologist-level performance

Hanxue Gu, Keyu Li, Roy J. Colglazier et al.

The assessment of knee osteoarthritis (KOA) severity on knee X-rays is a central criteria for the use of total knee arthroplasty. However, this assessment suffers from imprecise standards and a remarkably high inter-reader variability. An algorithmic, automated assessment of KOA severity could improve overall outcomes of knee replacement procedures by increasing the appropriateness of its use. We propose a novel deep learning-based five-step algorithm to automatically grade KOA from posterior-anterior (PA) views of radiographs: (1) image preprocessing (2) localization of knees joints in the image using the YOLO v3-Tiny model, (3) initial assessment of the severity of osteoarthritis using a convolutional neural network-based classifier, (4) segmentation of the joints and calculation of the joint space narrowing (JSN), and (5), a combination of the JSN and the initial assessment to determine a final Kellgren-Lawrence (KL) score. Furthermore, by displaying the segmentation masks used to make the assessment, our algorithm demonstrates a higher degree of transparency compared to typical "black box" deep learning classifiers. We perform a comprehensive evaluation using two public datasets and one dataset from our institution, and show that our algorithm reaches state-of-the art performance. Moreover, we also collected ratings from multiple radiologists at our institution and showed that our algorithm performs at the radiologist level. The software has been made publicly available at https://github.com/MaciejMazurowski/osteoarthritis-classification.

CVAug 1, 2024Code
Segment anything model 2: an application to 2D and 3D medical images

Haoyu Dong, Hanxue Gu, Yaqian Chen et al.

Segment Anything Model (SAM) has gained significant attention because of its ability to segment various objects in images given a prompt. The recently developed SAM 2 has extended this ability to video inputs. This opens an opportunity to apply SAM to 3D images, one of the fundamental tasks in the medical imaging field. In this paper, we extensively evaluate SAM 2's ability to segment both 2D and 3D medical images by first collecting 21 medical imaging datasets, including surgical videos, common 3D modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) as well as 2D modalities such as X-ray and ultrasound. Two evaluation settings of SAM 2 are considered: (1) multi-frame 3D segmentation, where prompts are provided to one or multiple slice(s) selected from the volume, and (2) single-frame 2D segmentation, where prompts are provided to each slice. The former only applies to videos and 3D modalities, while the latter applies to all datasets. Our results show that SAM 2 exhibits similar performance as SAM under single-frame 2D segmentation, and has variable performance under multi-frame 3D segmentation depending on the choices of slices to annotate, the direction of the propagation, the predictions utilized during the propagation, etc. We believe our work enhances the understanding of SAM 2's behavior in the medical field and provides directions for future work in adapting SAM 2 to this domain. Our code is available at: https://github.com/mazurowski-lab/segment-anything2-medical-evaluation.

IVJul 25, 2022
Deep learning-based algorithm for assessment of knee osteoarthritis severity in radiographs matches performance of radiologists

Albert Swiecicki, Nianyi Li, Jonathan O'Donnell et al.

A fully-automated deep learning algorithm matched performance of radiologists in assessment of knee osteoarthritis severity in radiographs using the Kellgren-Lawrence grading system. To develop an automated deep learning-based algorithm that jointly uses Posterior-Anterior (PA) and Lateral (LAT) views of knee radiographs to assess knee osteoarthritis severity according to the Kellgren-Lawrence grading system. We used a dataset of 9739 exams from 2802 patients from Multicenter Osteoarthritis Study (MOST). The dataset was divided into a training set of 2040 patients, a validation set of 259 patients and a test set of 503 patients. A novel deep learning-based method was utilized for assessment of knee OA in two steps: (1) localization of knee joints in the images, (2) classification according to the KL grading system. Our method used both PA and LAT views as the input to the model. The scores generated by the algorithm were compared to the grades provided in the MOST dataset for the entire test set as well as grades provided by 5 radiologists at our institution for a subset of the test set. The model obtained a multi-class accuracy of 71.90% on the entire test set when compared to the ratings provided in the MOST dataset. The quadratic weighted Kappa coefficient for this set was 0.9066. The average quadratic weighted Kappa between all pairs of radiologists from our institution who took a part of study was 0.748. The average quadratic-weighted Kappa between the algorithm and the radiologists at our institution was 0.769. The proposed model performed demonstrated equivalency of KL classification to MSK radiologists, but clearly superior reproducibility. Our model also agreed with radiologists at our institution to the same extent as the radiologists with each other. The algorithm could be used to provide reproducible assessment of knee osteoarthritis severity.

CVApr 20, 2023
Segment Anything Model for Medical Image Analysis: an Experimental Study

Maciej A. Mazurowski, Haoyu Dong, Hanxue Gu et al.

Training segmentation models for medical images continues to be challenging due to the limited availability of data annotations. Segment Anything Model (SAM) is a foundation model that is intended to segment user-defined objects of interest in an interactive manner. While the performance on natural images is impressive, medical image domains pose their own set of challenges. Here, we perform an extensive evaluation of SAM's ability to segment medical images on a collection of 19 medical imaging datasets from various modalities and anatomies. We report the following findings: (1) SAM's performance based on single prompts highly varies depending on the dataset and the task, from IoU=0.1135 for spine MRI to IoU=0.8650 for hip X-ray. (2) Segmentation performance appears to be better for well-circumscribed objects with prompts with less ambiguity and poorer in various other scenarios such as the segmentation of brain tumors. (3) SAM performs notably better with box prompts than with point prompts. (4) SAM outperforms similar methods RITM, SimpleClick, and FocalClick in almost all single-point prompt settings. (5) When multiple-point prompts are provided iteratively, SAM's performance generally improves only slightly while other methods' performance improves to the level that surpasses SAM's point-based performance. We also provide several illustrations for SAM's performance on all tested datasets, iterative segmentation, and SAM's behavior given prompt ambiguity. We conclude that SAM shows impressive zero-shot segmentation performance for certain medical imaging datasets, but moderate to poor performance for others. SAM has the potential to make a significant impact in automated medical image segmentation in medical imaging, but appropriate care needs to be applied when using it.

IVJul 27, 2022
Deep Learning for Classification of Thyroid Nodules on Ultrasound: Validation on an Independent Dataset

Jingxi Weng, Benjamin Wildman-Tobriner, Mateusz Buda et al.

Objectives: The purpose is to apply a previously validated deep learning algorithm to a new thyroid nodule ultrasound image dataset and compare its performances with radiologists. Methods: Prior study presented an algorithm which is able to detect thyroid nodules and then make malignancy classifications with two ultrasound images. A multi-task deep convolutional neural network was trained from 1278 nodules and originally tested with 99 separate nodules. The results were comparable with that of radiologists. The algorithm was further tested with 378 nodules imaged with ultrasound machines from different manufacturers and product types than the training cases. Four experienced radiologists were requested to evaluate the nodules for comparison with deep learning. Results: The Area Under Curve (AUC) of the deep learning algorithm and four radiologists were calculated with parametric, binormal estimation. For the deep learning algorithm, the AUC was 0.69 (95% CI: 0.64 - 0.75). The AUC of radiologists were 0.63 (95% CI: 0.59 - 0.67), 0.66 (95% CI:0.61 - 0.71), 0.65 (95% CI: 0.60 - 0.70), and 0.63 (95%CI: 0.58 - 0.67). Conclusion: In the new testing dataset, the deep learning algorithm achieved similar performances with all four radiologists. The relative performance difference between the algorithm and the radiologists is not significantly affected by the difference of ultrasound scanner.

CVApr 15, 2024Code
How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model

Hanxue Gu, Haoyu Dong, Jichen Yang et al.

Automated segmentation is a fundamental medical image analysis task, which enjoys significant advances due to the advent of deep learning. While foundation models have been useful in natural language processing and some vision tasks for some time, the foundation model developed with image segmentation in mind - Segment Anything Model (SAM) - has been developed only recently and has shown similar promise. However, there are still no systematic analyses or "best-practice" guidelines for optimal fine-tuning of SAM for medical image segmentation. This work summarizes existing fine-tuning strategies with various backbone architectures, model components, and fine-tuning algorithms across 18 combinations, and evaluates them on 17 datasets covering all common radiology modalities. Our study reveals that (1) fine-tuning SAM leads to slightly better performance than previous segmentation methods, (2) fine-tuning strategies that use parameter-efficient learning in both the encoder and decoder are superior to other strategies, (3) network architecture has a small impact on final performance, (4) further training SAM with self-supervised learning can improve final model performance. We also demonstrate the ineffectiveness of some methods popular in the literature and further expand our experiments into few-shot and prompt-based settings. Lastly, we released our code and MRI-specific fine-tuned weights, which consistently obtained superior performance over the original SAM, at https://github.com/mazurowski-lab/finetune-SAM.

CVNov 6, 2024Code
Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang et al.

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

IVJan 23, 2024Code
SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location on MRI

Hanxue Gu, Roy Colglazier, Haoyu Dong et al.

Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment planning. Specifically, segmenting bones in MRI would allow for more quantitative assessments of musculoskeletal conditions, while such assessments are largely absent in current radiological practice. The difficulty of bone MRI segmentation is illustrated by the fact that limited algorithms are publicly available for use, and those contained in the literature typically address a specific anatomic area. In our study, we propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations. The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation. Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions; (2) investigating several standard network architectures and strategies for automated segmentation; (3) introducing SegmentAnyBone, an innovative foundational model-based approach that extends Segment Anything Model (SAM); (4) comparative analysis of our algorithm and previous approaches; and (5) generalization analysis of our algorithm across different anatomical locations and MRI sequences, as well as an external dataset. We publicly release our model at https://github.com/mazurowski-lab/SegmentAnyBone.

IVFeb 13, 2025
Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis

Yaqian Chen, Hanxue Gu, Yuwen Chen et al.

Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.40% on skeletal muscle and 10.26% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Furthermore, the model provided muscular fat segmentation with a Dice coefficient of 56.27%, which can be utilized for additional analyses as needed.

CVFeb 1
Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning

Jichen Yang, Jikai Zhang, Benjamin Wildman-Tobriner et al.

The diagnosis of thyroid nodule cancers commonly utilizes ultrasound images. Several studies showed that deep learning algorithms designed to classify benign and malignant thyroid nodules could match radiologists' performance. However, data availability for training deep learning models is often limited due to the significant effort required to curate such datasets. The previous study proposed a method to curate thyroid nodule datasets automatically. It was tested to have a 63% yield rate and 83% accuracy. However, the usefulness of the generated data for training deep learning models remains unknown. In this study, we conducted experiments to determine whether using a automatically-curated dataset improves deep learning algorithms' performance. We trained deep learning models on the manually annotated and automatically-curated datasets. We also trained with a smaller subset of the automatically-curated dataset that has higher accuracy to explore the optimum usage of such dataset. As a result, the deep learning model trained on the manually selected dataset has an AUC of 0.643 (95% confidence interval [CI]: 0.62, 0.66). It is significantly lower than the AUC of the 6automatically-curated dataset trained deep learning model, 0.694 (95% confidence interval [CI]: 0.67, 0.73, P < .001). The AUC of the accurate subset trained deep learning model is 0.689 (95% confidence interval [CI]: 0.66, 0.72, P > .43), which is insignificantly worse than the AUC of the full automatically-curated dataset. In conclusion, we showed that using a automatically-curated dataset can substantially increase the performance of deep learning algorithms, and it is suggested to use all the data rather than only using the accurate subset.

IVJul 18, 2025
BreastSegNet: Multi-label Segmentation of Breast MRI

Qihang Li, Jichen Yang, Yaqian Chen et al.

Breast MRI provides high-resolution imaging critical for breast cancer screening and preoperative staging. However, existing segmentation methods for breast MRI remain limited in scope, often focusing on only a few anatomical structures, such as fibroglandular tissue or tumors, and do not cover the full range of tissues seen in scans. This narrows their utility for quantitative analysis. In this study, we present BreastSegNet, a multi-label segmentation algorithm for breast MRI that covers nine anatomical labels: fibroglandular tissue (FGT), vessel, muscle, bone, lesion, lymph node, heart, liver, and implant. We manually annotated a large set of 1123 MRI slices capturing these structures with detailed review and correction from an expert radiologist. Additionally, we benchmark nine segmentation models, including U-Net, SwinUNet, UNet++, SAM, MedSAM, and nnU-Net with multiple ResNet-based encoders. Among them, nnU-Net ResEncM achieves the highest average Dice scores of 0.694 across all labels. It performs especially well on heart, liver, muscle, FGT, and bone, with Dice scores exceeding 0.73, and approaching 0.90 for heart and liver. All model code and weights are publicly available, and we plan to release the data at a later date.

SPJun 18, 2025
SegmentAnyMuscle: A universal muscle segmentation model across different locations in MRI

Roy Colglazier, Jisoo Lee, Haoyu Dong et al.

The quantity and quality of muscles are increasingly recognized as important predictors of health outcomes. While MRI offers a valuable modality for such assessments, obtaining precise quantitative measurements of musculature remains challenging. This study aimed to develop a publicly available model for muscle segmentation in MRIs and demonstrate its applicability across various anatomical locations and imaging sequences. A total of 362 MRIs from 160 patients at a single tertiary center (Duke University Health System, 2016-2020) were included, with 316 MRIs from 114 patients used for model development. The model was tested on two separate sets: one with 28 MRIs representing common sequence types, achieving an average Dice Similarity Coefficient (DSC) of 88.45%, and another with 18 MRIs featuring less frequent sequences and abnormalities such as muscular atrophy, hardware, and significant noise, achieving 86.21% DSC. These results demonstrate the feasibility of a fully automated deep learning algorithm for segmenting muscles on MRI across diverse settings. The public release of this model enables consistent, reproducible research into the relationship between musculature and health.

SDJun 22, 2021
Key-Sparse Transformer for Multimodal Speech Emotion Recognition

Weidong Chen, Xiaofeng Xing, Xiangmin Xu et al.

Speech emotion recognition is a challenging research topic that plays a critical role in human-computer interaction. Multimodal inputs further improve the performance as more emotional information is used. However, existing studies learn all the information in the sample while only a small portion of it is about emotion. The redundant information will become noises and limit the system performance. In this paper, a key-sparse Transformer is proposed for efficient emotion recognition by focusing more on emotion related information. The proposed method is evaluated on the IEMOCAP and LSSED. Experimental results show that the proposed method achieves better performance than the state-of-the-art approaches.

CVJan 16, 2021
GridTracer: Automatic Mapping of Power Grids using Deep Learning and Overhead Imagery

Bohao Huang, Jichen Yang, Artem Streltsov et al.

Energy system information valuable for electricity access planning such as the locations and connectivity of electricity transmission and distribution towers, termed the power grid, is often incomplete, outdated, or altogether unavailable. Furthermore, conventional means for collecting this information is costly and limited. We propose to automatically map the grid in overhead remotely sensed imagery using deep learning. Towards this goal, we develop and publicly-release a large dataset ($263km^2$) of overhead imagery with ground truth for the power grid, to our knowledge this is the first dataset of its kind in the public domain. Additionally, we propose scoring metrics and baseline algorithms for two grid mapping tasks: (1) tower recognition and (2) power line interconnection (i.e., estimating a graph representation of the grid). We hope the availability of the training data, scoring metrics, and baselines will facilitate rapid progress on this important problem to help decision-makers address the energy needs of societies around the world.

ASOct 8, 2020
HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

Rohan Kumar Das, Ruijie Tao, Jichen Yang et al.

This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses on a multimedia track to recognize speakers with both audio and visual information. We developed separate systems for audio and visual inputs followed by a score level fusion of the systems from the two modalities to collectively use their information. The audio systems are based on x-vector based speaker embedding, whereas the face recognition systems are based on ResNet and InsightFace based face embeddings. With post evaluation studies and refinements, we obtain an equal error rate (EER) of 0.88% and an actual detection cost function (actDCF) of 0.026 on the evaluation set of 2019 NIST multimedia SRE corpus.

ASSep 9, 2020
Multi-modal Attention for Speech Emotion Recognition

Zexu Pan, Zhaojie Luo, Jichen Yang et al.

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

LGSep 11, 2019
PDA: Progressive Data Augmentation for General Robustness of Deep Neural Networks

Hang Yu, Aishan Liu, Xianglong Liu et al.

Adversarial images are designed to mislead deep neural networks (DNNs), attracting great attention in recent years. Although several defense strategies achieved encouraging robustness against adversarial samples, most of them fail to improve the robustness on common corruptions such as noise, blur, and weather/digital effects (e.g. frost, pixelate). To address this problem, we propose a simple yet effective method, named Progressive Data Augmentation (PDA), which enables general robustness of DNNs by progressively injecting diverse adversarial noises during training. In other words, DNNs trained with PDA are able to obtain more robustness against both adversarial attacks as well as common corruptions than the recent state-of-the-art methods. We also find that PDA is more efficient than prior arts and able to prevent accuracy drop on clean samples without being attacked. Furthermore, we theoretically show that PDA can control the perturbation bound and guarantee better generalization ability than existing work. Extensive experiments on many benchmarks such as CIFAR-10, SVHN, and ImageNet demonstrate that PDA significantly outperforms its counterparts in various experimental setups.

ASApr 16, 2019
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences

Kong Aik Lee, Ville Hautamaki, Tomi Kinnunen et al.

The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.

ASSep 17, 2018
Generative x-vectors for text-independent speaker verification

Longting Xu, Rohan Kumar Das, Emre Yılmaz et al.

Speaker verification (SV) systems using deep neural network embeddings, so-called the x-vector systems, are becoming popular due to its good performance superior to the i-vector systems. The fusion of these systems provides improved performance benefiting both from the discriminatively trained x-vectors and generative i-vectors capturing distinct speaker characteristics. In this paper, we propose a novel method to include the complementary information of i-vector and x-vector, that is called generative x-vector. The generative x-vector utilizes a transformation model learned from the i-vector and x-vector representations of the background data. Canonical correlation analysis is applied to derive this transformation model, which is later used to transform the standard x-vectors of the enrollment and test segments to the corresponding generative x-vectors. The SV experiments performed on the NIST SRE 2010 dataset demonstrate that the system using generative x-vectors provides considerably better performance than the baseline i-vector and x-vector systems. Furthermore, the generative x-vectors outperform the fusion of i-vector and x-vector systems for long-duration utterances, while yielding comparable results for short-duration utterances.