IVSep 29, 2022
Diffusion Adversarial Representation Learning for Self-supervised Vessel SegmentationBoah Kim, Yujin Oh, Jong Chul Ye
Vessel segmentation in medical images is one of the important tasks in the diagnosis of vascular diseases and therapy planning. Although learning-based segmentation approaches have been extensively studied, a large amount of ground-truth labels are required in supervised methods and confusing background structures make neural networks hard to segment vessels in an unsupervised manner. To address this, here we introduce a novel diffusion adversarial representation learning (DARL) model that leverages a denoising diffusion probabilistic model with adversarial learning, and apply it to vessel segmentation. In particular, for self-supervised vessel segmentation, DARL learns the background signal using a diffusion module, which lets a generation module effectively provide vessel representations. Also, by adversarial learning based on the proposed switchable spatially-adaptive denormalization, our model estimates synthetic fake vessel images as well as vessel segmentation masks, which further makes the model capture vessel-relevant semantic information. Once the proposed model is trained, the model generates segmentation masks in a single step and can be applied to general vascular structure segmentation of coronary angiography and retinal images. Experimental results on various datasets show that our method significantly outperforms existing unsupervised and self-supervised vessel segmentation methods.
CVJan 28, 2023
ZegOT: Zero-shot Segmentation Through Optimal Transport of Text PromptsKwanyoung Kim, Yujin Oh, Jong Chul Ye
Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport. In particular, we introduce a novel Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an optimal mapping between multiple text prompts and visual feature maps of the frozen image encoder hidden layers. This unique mapping method facilitates each of the multiple text prompts to effectively focus on distinct visual semantic attributes. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance over existing Zero-shot Semantic Segmentation (ZS3) approaches.
IVNov 3, 2023
LLM-driven Multimodal Target Volume Contouring in Radiation OncologyYujin Oh, Sangjoon Park, Hwa Kyung Byun et al.
Target volume contouring for radiation therapy is considered significantly more challenging than the normal organ segmentation tasks as it necessitates the utilization of both image and text-based clinical information. Inspired by the recent advancement of large language models (LLMs) that can facilitate the integration of the textural information and images, here we present a novel LLM-driven multimodal AI, namely LLMSeg, that utilizes the clinical text information and is applicable to the challenging task of target volume contouring for radiation therapy, and validate it within the context of breast cancer radiation therapy target volume contouring. Using external validation and data-insufficient environments, which attributes highly conducive to real-world applications, we demonstrate that the proposed model exhibits markedly improved performance compared to conventional unimodal AI models, particularly exhibiting robust generalization performance and data efficiency. To our best knowledge, this is the first LLM-driven multimodal AI model that integrates the clinical text information into target volume delineation for radiation oncology.
IVJul 31, 2023
C-DARL: Contrastive diffusion adversarial representation learning for label-free blood vessel segmentationBoah Kim, Yujin Oh, Bradford J. Wood et al.
Blood vessel segmentation in medical imaging is one of the essential steps for vascular disease diagnosis and interventional planning in a broad spectrum of clinical scenarios in image-based medicine and interventional medicine. Unfortunately, manual annotation of the vessel masks is challenging and resource-intensive due to subtle branches and complex structures. To overcome this issue, this paper presents a self-supervised vessel segmentation method, dubbed the contrastive diffusion adversarial representation learning (C-DARL) model. Our model is composed of a diffusion module and a generation module that learns the distribution of multi-domain blood vessel data by generating synthetic vessel images from diffusion latent. Moreover, we employ contrastive learning through a mask-based contrastive loss so that the model can learn more realistic vessel representations. To validate the efficacy, C-DARL is trained using various vessel datasets, including coronary angiograms, abdominal digital subtraction angiograms, and retinal imaging. Experimental results confirm that our model achieves performance improvement over baseline methods with noise robustness, suggesting the effectiveness of C-DARL for vessel segmentation.
IVSep 27, 2024
Mixture of Multicenter Experts in Multimodal AI for Debiased Radiotherapy Target DelineationYujin Oh, Sangjoon Park, Xiang Li et al.
Clinical decision-making reflects diverse strategies shaped by regional patient populations and institutional protocols. However, most existing medical artificial intelligence (AI) models are trained on highly prevalent data patterns, which reinforces biases and fails to capture the breadth of clinical expertise. Inspired by the recent advances in Mixture of Experts (MoE), we propose a Mixture of Multicenter Experts (MoME) framework to address AI bias in the medical domain without requiring data sharing across institutions. MoME integrates specialized expertise from diverse clinical strategies to enhance model generalizability and adaptability across medical centers. We validate this framework using a multimodal target volume delineation model for prostate cancer radiotherapy. With few-shot training that combines imaging and clinical notes from each center, the model outperformed baselines, particularly in settings with high inter-center variability or limited data availability. Furthermore, MoME enables model customization to local clinical preferences without cross-institutional data exchange, making it especially suitable for resource-constrained settings while promoting broadly generalizable medical AI.
CVSep 26, 2024
High-Fidelity 3D Lung CT Synthesis in ARDS Swine Models Using Score-Based 3D Residual Diffusion ModelsSiyeop Yoon, Yujin Oh, Xiang Li et al.
Acute respiratory distress syndrome (ARDS) is a severe condition characterized by lung inflammation and respiratory failure, with a high mortality rate of approximately 40%. Traditional imaging methods, such as chest X-rays, provide only two-dimensional views, limiting their effectiveness in fully assessing lung pathology. Three-dimensional (3D) computed tomography (CT) offers a more comprehensive visualization, enabling detailed analysis of lung aeration, atelectasis, and the effects of therapeutic interventions. However, the routine use of CT in ARDS management is constrained by practical challenges and risks associated with transporting critically ill patients to remote scanners. In this study, we synthesize high-fidelity 3D lung CT from 2D generated X-ray images with associated physiological parameters using a score-based 3D residual diffusion model. Our preliminary results demonstrate that this approach can produce high-quality 3D CT images that are validated with ground truth, offering a promising solution for enhancing ARDS management.
IVFeb 2, 2025Code
Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic PerspectiveYujin Oh, Pengfei Jin, Sangjoon Park et al.
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE.
33.4IVMay 12
Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MILHyun Do Jung, Jungwon Choi, Soojung Choi et al.
Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.
39.8CVMay 11
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image SegmentationYiqi Tian, Sangjoon Park, Bo Zeng et al.
Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.
CVMar 21, 2024
OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic SegmentationKwanyoung Kim, Yujin Oh, Jong Chul Ye
The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.
CVNov 23, 2024
Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain AugmentationJunhyeok Lee, Yujin Oh, Dahyoun Lee et al.
Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.
CVMar 4, 2025
Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional ImagingYujin Oh, Robert Seifert, Yihan Cao et al.
In oncology, Positron Emission Tomography-Computed Tomography (PET/CT) is widely used in cancer diagnosis, staging, and treatment monitoring, as it combines anatomical details from CT with functional metabolic activity and molecular marker expression information from PET. However, existing artificial intelligence-driven PET/CT analyses rely predominantly on task-specific models trained from scratch or on limited datasets, limiting their generalizability and robustness. To address this, we propose a foundation model approach specifically designed for multimodal PET/CT imaging. We introduce the Cross-Fraternal Twin Masked Autoencoder (FratMAE), a novel framework that effectively integrates whole-body anatomical and functional or molecular information. FratMAE employs separate Vision Transformer (ViT) encoders for PET and CT scans, along with cross-attention decoders that enable synergistic interactions between modalities during masked autoencoder training. Additionally, it incorporates textual metadata to enhance PET representation learning. By pre-training on PET/CT datasets, FratMAE captures intricate cross-modal relationships and global uptake patterns, achieving superior performance on downstream tasks and demonstrating its potential as a generalizable foundation model.
AIFeb 9
CoMMa: Contribution-Aware Medical Multi-Agents From A Game-Theoretic PerspectiveYichen Wu, Yujin Oh, Sangjoon Park et al.
Recent multi-agent frameworks have broadened the ability to tackle oncology decision support tasks that require reasoning over dynamic, heterogeneous patient data. We propose Contribution-Aware Medical Multi-Agents (CoMMa), a decentralized LLM-agent framework in which specialists operate on partitioned evidence and coordinate through a game-theoretic objective for robust decision-making. In contrast to most agent architectures relying on stochastic narrative-based reasoning, CoMMa utilizes deterministic embedding projections to approximate contribution-aware credit assignment. This yields explicit evidence attribution by estimating each agent's marginal utility, producing interpretable and mathematically grounded decision pathways with improved stability. Evaluated on diverse oncology benchmarks, including a real-world multidisciplinary tumor board dataset, CoMMa achieves higher accuracy and more stable performance than data-centralized and role-based multi-agents baselines.
IVMay 28, 2025
Surf2CT: Cascaded 3D Flow Matching Models for Torso 3D CT Synthesis from Skin SurfaceSiyeop Yoon, Yujin Oh, Pengfei Jin et al.
We present Surf2CT, a novel cascaded flow matching framework that synthesizes full 3D computed tomography (CT) volumes of the human torso from external surface scans and simple demographic data (age, sex, height, weight). This is the first approach capable of generating realistic volumetric internal anatomy images solely based on external body shape and demographics, without any internal imaging. Surf2CT proceeds through three sequential stages: (1) Surface Completion, reconstructing a complete signed distance function (SDF) from partial torso scans using conditional 3D flow matching; (2) Coarse CT Synthesis, generating a low-resolution CT volume from the completed SDF and demographic information; and (3) CT Super-Resolution, refining the coarse volume into a high-resolution CT via a patch-wise conditional flow model. Each stage utilizes a 3D-adapted EDM2 backbone trained via flow matching. We trained our model on a combined dataset of 3,198 torso CT scans (approximately 1.13 million axial slices) sourced from Massachusetts General Hospital (MGH) and the AutoPET challenge. Evaluation on 700 paired torso surface-CT cases demonstrated strong anatomical fidelity: organ volumes exhibited small mean percentage differences (range from -11.1% to 4.4%), and muscle/fat body composition metrics matched ground truth with strong correlation (range from 0.67 to 0.96). Lung localization had minimal bias (mean difference -2.5 mm), and surface completion significantly improved metrics (Chamfer distance: from 521.8 mm to 2.7 mm; Intersection-over-Union: from 0.87 to 0.98). Surf2CT establishes a new paradigm for non-invasive internal anatomical imaging using only external data, opening opportunities for home-based healthcare, preventive medicine, and personalized clinical assessments without the risks associated with conventional imaging techniques.
IVMay 28, 2025
Cascaded 3D Diffusion Models for Whole-body 3D 18-F FDG PET/CT synthesis from DemographicsSiyeop Yoon, Sifan Song, Pengfei Jin et al.
We propose a cascaded 3D diffusion model framework to synthesize high-fidelity 3D PET/CT volumes directly from demographic variables, addressing the growing need for realistic digital twins in oncologic imaging, virtual trials, and AI-driven data augmentation. Unlike deterministic phantoms, which rely on predefined anatomical and metabolic templates, our method employs a two-stage generative process. An initial score-based diffusion model synthesizes low-resolution PET/CT volumes from demographic variables alone, providing global anatomical structures and approximate metabolic activity. This is followed by a super-resolution residual diffusion model that refines spatial resolution. Our framework was trained on 18-F FDG PET/CT scans from the AutoPET dataset and evaluated using organ-wise volume and standardized uptake value (SUV) distributions, comparing synthetic and real data between demographic subgroups. The organ-wise comparison demonstrated strong concordance between synthetic and real images. In particular, most deviations in metabolic uptake values remained within 3-5% of the ground truth in subgroup analysis. These findings highlight the potential of cascaded 3D diffusion models to generate anatomically and metabolically accurate PET/CT images, offering a robust alternative to traditional phantoms and enabling scalable, population-informed synthetic imaging for clinical and research applications.
CVMay 8, 2025
OWT: A Foundational Organ-Wise Tokenization Framework for Medical ImagingSifan Song, Siyeop Yoon, Pengfei Jin et al.
Recent advances in representation learning often rely on holistic embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging, where downstream tasks depend on anatomically interpretable features. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while enabling fine-grained control for targeted clinical applications. Experiments on CT and MRI datasets demonstrate OWT's power: it not only achieves strong performance on standard tasks like image reconstruction and segmentation, but also unlocks novel, high-impact clinical capabilities including organ-specific tumor identification, organ-level retrieval and semantic-level generation, without requiring any additional training. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and a new perspective on how representations can be leveraged.
CVMar 18, 2025
MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven PromptsRunqi Meng, Sifan Song, Pengfei Jin et al.
Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.
IVMar 6, 2025
Prediction of Frozen Region Growth in Kidney Cryoablation Intervention Using a 3D Flow-Matching ModelSiyeop Yoon, Yujin Oh, Matthew Tivnan et al.
This study presents a 3D flow-matching model designed to predict the progression of the frozen region (iceball) during kidney cryoablation. Precise intraoperative guidance is critical in cryoablation to ensure complete tumor eradication while preserving adjacent healthy tissue. However, conventional methods, typically based on physics driven or diffusion based simulations, are computationally demanding and often struggle to represent complex anatomical structures accurately. To address these limitations, our approach leverages intraoperative CT imaging to inform the model. The proposed 3D flow matching model is trained to learn a continuous deformation field that maps early-stage CT scans to future predictions. This transformation not only estimates the volumetric expansion of the iceball but also generates corresponding segmentation masks, effectively capturing spatial and morphological changes over time. Quantitative analysis highlights the model robustness, demonstrating strong agreement between predictions and ground-truth segmentations. The model achieves an Intersection over Union (IoU) score of 0.61 and a Dice coefficient of 0.75. By integrating real time CT imaging with advanced deep learning techniques, this approach has the potential to enhance intraoperative guidance in kidney cryoablation, improving procedural outcomes and advancing the field of minimally invasive surgery.
CVNov 27, 2023
End-to-End Breast Cancer Radiotherapy Planning via LMMs with Consistency EmbeddingKwanyoung Kim, Yujin Oh, Sangjoon Park et al.
Recent advances in AI foundation models have significant potential for lightening the clinical workload by mimicking the comprehensive and multi-faceted approaches used by medical professionals. In the field of radiation oncology, the integration of multiple modalities holds great importance, so the opportunity of foundational model is abundant. Inspired by this, here we present RO-LMM, a multi-purpose, comprehensive large multimodal model (LMM) tailored for the field of radiation oncology. This model effectively manages a series of tasks within the clinical workflow, including clinical context summarization, radiation treatment plan suggestion, and plan-guided target volume segmentation by leveraging the capabilities of LMM. In particular, to perform consecutive clinical tasks without error accumulation, we present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LMM's robustness to noisy inputs while preserving the consistency of handling clean inputs. We further extend this concept to LMM-driven segmentation framework, leading to a novel Consistency Embedding Segmentation (CESEG) techniques. Experimental results including multi-centre validation confirm that our RO-LMM with CEFTune and CESEG results in promising performance for multiple clinical tasks with generalization capabilities.
IVFeb 17, 2022
Multi-Scale Hybrid Vision Transformer for Learning Gastric Histology: AI-Based Decision Support System for Gastric Cancer TreatmentYujin Oh, Go Eun Bae, Kyung-Hee Kim et al.
Gastric endoscopic screening is an effective way to decide appropriate gastric cancer (GC) treatment at an early stage, reducing GC-associated mortality rate. Although artificial intelligence (AI) has brought a great promise to assist pathologist to screen digitalized whole slide images, existing AI systems are limited in fine-grained cancer subclassifications and have little usability in planning cancer treatment. We propose a practical AI system that enables five subclassifications of GC pathology, which can be directly matched to general GC treatment guidance. The AI system is designed to efficiently differentiate multi-classes of GC through multi-scale self-attention mechanism using 2-stage hybrid Vision Transformer (ViT) networks, by mimicking the way how human pathologists understand histology. The AI system demonstrates reliable diagnostic performance by achieving class-average sensitivity of above 0.85 on a total of 1,212 slides from multicentric cohort. Furthermore, AI-assisted pathologists show significantly improved diagnostic sensitivity by 12% in addition to 18% reduced screening time compared to human pathologists. Our results demonstrate that AI-assisted gastric endoscopic screening has a great potential for providing presumptive pathologic opinion and appropriate cancer treatment of gastric cancer in practical clinical settings.
IVFeb 13, 2022
AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillationSangjoon Park, Gwanghyun Kim, Yujin Oh et al.
Although deep learning-based computer-aided diagnosis systems have recently achieved expert-level performance, developing a robust deep learning model requires large, high-quality data with manual annotation, which is expensive to obtain. This situation poses the problem that the chest x-rays collected annually in hospitals cannot be used due to the lack of manual labeling by experts, especially in deprived areas. To address this, here we present a novel deep learning framework that uses knowledge distillation through self-supervised learning and self-training, which shows that the performance of the original model trained with a small number of labels can be gradually improved with more unlabeled data. Experimental results show that the proposed framework maintains impressive robustness against a real-world environment and has general applicability to several diagnostic tasks such as tuberculosis, pneumothorax, and COVID-19. Notably, we demonstrated that our model performs even better than those trained with the same amount of labeled data. The proposed framework has a great potential for medical imaging, where plenty of data is accumulated every year, but ground truth annotations are expensive to obtain.
IVApr 15, 2021
Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity QuantificationSangjoon Park, Gwanghyun Kim, Yujin Oh et al.
Developing a robust algorithm to diagnose and quantify the severity of COVID-19 using Chest X-ray (CXR) requires a large number of well-curated COVID-19 datasets, which is difficult to collect under the global COVID-19 pandemic. On the other hand, CXR data with other findings are abundant. This situation is ideally suited for the Vision Transformer (ViT) architecture, where a lot of unlabeled data can be used through structural modeling by the self-attention mechanism. However, the use of existing ViT is not optimal, since feature embedding through direct patch flattening or ResNet backbone in the standard ViT is not intended for CXR. To address this problem, here we propose a novel Vision Transformer that utilizes low-level CXR feature corpus obtained from a backbone network that extracts common CXR findings. Specifically, the backbone network is first trained with large public datasets to detect common abnormal findings such as consolidation, opacity, edema, etc. Then, the embedded features from the backbone network are used as corpora for a Transformer model for the diagnosis and the severity quantification of COVID-19. We evaluate our model on various external test datasets from totally different institutions to evaluate the generalization capability. The experimental results confirm that our model can achieve the state-of-the-art performance in both diagnosis and severity quantification tasks with superior generalization capability, which are sine qua non of widespread deployment.
IVApr 13, 2021
CXR Segmentation by AdaIN-based Domain Adaptation and Knowledge DistillationYujin Oh, Jong Chul Ye
As segmentation labels are scarce, extensive researches have been conducted to train segmentation networks with domain adaptation, semi-supervised or self-supervised learning techniques to utilize abundant unlabeled dataset. However, these approaches appear different from each other, so it is not clear how these approaches can be combined for better performance. Inspired by recent multi-domain image translation approaches, here we propose a novel segmentation framework using adaptive instance normalization (AdaIN), so that a single generator is trained to perform both domain adaptation and semi-supervised segmentation tasks via knowledge distillation by simply changing task-specific AdaIN codes. Specifically, our framework is designed to deal with difficult situations in chest X-ray radiograph (CXR) segmentation, where labels are only available for normal data, but the trained model should be applied to both normal and abnormal data. The proposed network demonstrates great generalizability under domain shift and achieves the state-of-the-art performance for abnormal CXR segmentation.
IVMar 12, 2021
Severity Quantification and Lesion Localization of COVID-19 on CXR using Vision TransformerGwanghyun Kim, Sangjoon Park, Yujin Oh et al.
Under the global pandemic of COVID-19, building an automated framework that quantifies the severity of COVID-19 and localizes the relevant lesion on chest X-ray images has become increasingly important. Although pixel-level lesion severity labels, e.g. lesion segmentation, can be the most excellent target to build a robust model, collecting enough data with such labels is difficult due to time and labor-intensive annotation tasks. Instead, array-based severity labeling that assigns integer scores on six subdivisions of lungs can be an alternative choice enabling the quick labeling. Several groups proposed deep learning algorithms that quantify the severity of COVID-19 using the array-based COVID-19 labels and localize the lesions with explainability maps. To further improve the accuracy and interpretability, here we propose a novel Vision Transformer tailored for both quantification of the severity and clinically applicable localization of the COVID-19 related lesions. Our model is trained in a weakly-supervised manner to generate the full probability maps from weak array-based labels. Furthermore, a novel progressive self-training method enables us to build a model with a small labeled dataset. The quantitative and qualitative analysis on the external testset demonstrates that our method shows comparable performance with radiologists for both tasks with stability in a real-world application.
IVMar 12, 2021
Vision Transformer for COVID-19 CXR Diagnosis using Chest X-ray Feature CorpusSangjoon Park, Gwanghyun Kim, Yujin Oh et al.
Under the global COVID-19 crisis, developing robust diagnosis algorithm for COVID-19 using CXR is hampered by the lack of the well-curated COVID-19 data set, although CXR data with other disease are abundant. This situation is suitable for vision transformer architecture that can exploit the abundant unlabeled data using pre-training. However, the direct use of existing vision transformer that uses the corpus generated by the ResNet is not optimal for correct feature embedding. To mitigate this problem, we propose a novel vision Transformer by using the low-level CXR feature corpus that are obtained to extract the abnormal CXR features. Specifically, the backbone network is trained using large public datasets to obtain the abnormal features in routine diagnosis such as consolidation, glass-grass opacity (GGO), etc. Then, the embedded features from the backbone network are used as corpus for vision transformer training. We examine our model on various external test datasets acquired from totally different institutions to assess the generalization ability. Our experiments demonstrate that our method achieved the state-of-art performance and has better generalization capability, which are crucial for a widespread deployment.
IVApr 13, 2020
Deep Learning COVID-19 Features on CXR using Limited Training Data SetsYujin Oh, Sangjoon Park, Jong Chul Ye
Under the global pandemic of COVID-19, the use of artificial intelligence to analyze chest X-ray (CXR) image for COVID-19 diagnosis and patient triage is becoming important. Unfortunately, due to the emergent nature of the COVID-19 pandemic, a systematic collection of the CXR data set for deep neural network training is difficult. To address this problem, here we propose a patch-based convolutional neural network approach with a relatively small number of trainable parameters for COVID-19 diagnosis. The proposed method is inspired by our statistical analysis of the potential imaging biomarkers of the CXR radiographs. Experimental results show that our method achieves state-of-the-art performance and provides clinically interpretable saliency maps, which are useful for COVID-19 diagnosis and patient triage.