AINov 5, 2022
Coarse-to-fine Knowledge Graph Domain Adaptation based on Distantly-supervised Iterative TrainingHongmin Cai, Wenxiong Liao, Zhengliang Liu et al. · harvard
Modern supervised learning neural network models require a large amount of manually labeled data, which makes the construction of domain-specific knowledge graphs time-consuming and labor-intensive. In parallel, although there has been much research on named entity recognition and relation extraction based on distantly supervised learning, constructing a domain-specific knowledge graph from large collections of textual data without manual annotations is still an urgent problem to be solved. In response, we propose an integrated framework for adapting and re-learning knowledge graphs from one coarse domain (biomedical) to a finer-define domain (oncology). In this framework, we apply distant-supervision on cross-domain knowledge graph adaptation. Consequently, no manual data annotation is required to train the model. We introduce a novel iterative training strategy to facilitate the discovery of domain-specific named entities and triples. Experimental results indicate that the proposed framework can perform domain adaptation and construction of knowledge graph efficiently.
CVMar 10, 2022Code
Cascaded Sparse Feature Propagation Network for Interactive SegmentationChuyu Zhang, Chuanyang Hu, Hui Ren et al.
We aim to tackle the problem of point-based interactive segmentation, in which the key challenge is to propagate the user-provided annotations to unlabeled regions efficiently. Existing methods tackle this challenge by utilizing computationally expensive fully connected graphs or transformer architectures that sacrifice important fine-grained information required for accurate segmentation. To overcome these limitations, we propose a cascade sparse feature propagation network that learns a click-augmented feature representation for propagating user-provided information to unlabeled regions. The sparse design of our network enables efficient information propagation on high-resolution features, resulting in more detailed object segmentation. We validate the effectiveness of our method through comprehensive experiments on various benchmarks, and the results demonstrate the superior performance of our approach. Code is available at \href{https://github.com/kleinzcy/CSFPN}{https://github.com/kleinzcy/CSFPN}.
ASJul 5, 2023
Exploring Multimodal Approaches for Alzheimer's Disease Detection Using Patient Speech Transcript and Audio DataHongmin Cai, Xiaoke Huang, Zhengliang Liu et al.
Alzheimer's disease (AD) is a common form of dementia that severely impacts patient health. As AD impairs the patient's language understanding and expression ability, the speech of AD patients can serve as an indicator of this disease. This study investigates various methods for detecting AD using patients' speech and transcripts data from the DementiaBank Pitt database. The proposed approach involves pre-trained language models and Graph Neural Network (GNN) that constructs a graph from the speech transcript, and extracts features using GNN for AD detection. Data augmentation techniques, including synonym replacement, GPT-based augmenter, and so on, were used to address the small dataset size. Audio data was also introduced, and WavLM model was used to extract audio features. These features were then fused with text features using various methods. Finally, a contrastive learning approach was attempted by converting speech transcripts back to audio and using it for contrastive learning with the original audio. We conducted intensive experiments and analysis on the above methods. Our findings shed light on the challenges and potential solutions in AD detection using speech and audio data.
CLJul 21, 2023
CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical StudyZihan Guan, Zihao Wu, Zhengliang Liu et al.
Participant recruitment based on unstructured medical texts such as clinical notes and radiology reports has been a challenging yet important task for the cohort establishment in clinical research. Recently, Large Language Models (LLMs) such as ChatGPT have achieved tremendous success in various downstream tasks thanks to their promising performance in language understanding, inference, and generation. It is then natural to test their feasibility in solving the cohort recruitment task, which involves the classification of a given paragraph of medical text into disease label(s). However, when applied to knowledge-intensive problem settings such as medical text classification, where the LLMs are expected to understand the decision made by human experts and accurately identify the implied disease labels, the LLMs show a mediocre performance. A possible explanation is that, by only using the medical text, the LLMs neglect to use the rich context of additional information that languages afford. To this end, we propose to use a knowledge graph as auxiliary information to guide the LLMs in making predictions. Moreover, to further boost the LLMs adapt to the problem setting, we apply a chain-of-thought (CoT) sample selection strategy enhanced by reinforcement learning, which selects a set of CoT samples given each individual medical report. Experimental results and various ablation studies show that our few-shot learning method achieves satisfactory performance compared with fine-tuning strategies and gains superb advantages when the available data is limited. The code and sample dataset of the proposed CohortGPT model is available at: https://anonymous.4open.science/r/CohortGPT-4872/
CLApr 1, 2025Code
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge GraphsJuncheng Wu, Wenlong Deng, Xingxuan Li et al.
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code is available at https://github.com/UCSC-VLAA/MedReason.
CVFeb 17
VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch GenerationHui Ren, Yuval Alaluf, Omer Bar Tal et al.
Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.
CVJul 17, 2024
Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud SegmentationRuijie Xu, Chuyu Zhang, Hui Ren et al.
We tackle the novel class discovery in point cloud segmentation, which discovers novel classes based on the semantic knowledge of seen classes. Existing work proposes an online point-wise clustering method with a simplified equal class-size constraint on the novel classes to avoid degenerate solutions. However, the inherent imbalanced distribution of novel classes in point clouds typically violates the equal class-size constraint. Moreover, point-wise clustering ignores the rich spatial context information of objects, which results in less expressive representation for semantic segmentation. To address the above challenges, we propose a novel self-labeling strategy that adaptively generates high-quality pseudo-labels for imbalanced classes during model training. In addition, we develop a dual-level representation that incorporates regional consistency into the point-level classifier learning, reducing noise in generated segmentation. Finally, we conduct extensive experiments on two widely used datasets, SemanticKITTI and SemanticPOSS, and the results show our method outperforms the state of the art by a large margin.
LGMay 15
Learning Context-conditioned Gaussian Overbounds for Convolution-Based Uncertainty PropagationRuirui Liu, Xuejie Hou, Yiping Jiang et al.
Uncertainty quantification is essential in safety-critical settings--from autonomous driving to aviation, finance, and health--where decisions must rely on conservative bounds rather than point estimates. Predictor-level intervals (e.g., from quantile regression, conformal prediction, variance networks, or Bayesian models) generally do not compose: adding two per-variable intervals need not yield a valid interval for their sum or preserve coverage. In aviation, Gaussian overbounding replaces complex error distributions with a conservative Gaussian whose tails dominate the truth, so conservatism propagates through linear operations. Yet classical overbounds are global, often overly conservative, and hard to adapt to feature-conditioned errors. We propose a unified learning framework that trains neural networks to produce context-aware Gaussian overbounds--mean and scale--with provable conservatism on a finite quantile grid and, under three explicit regularity assumptions, continuous-tail conservatism on a certified interval. Our overbounding loss enforces conservativeness at selected quantiles while penalizing distributional distance with a Wasserstein-style term. The learned bounds support conservative linear-combination and convolution analysis on the enforced grid, and on the certified interval when assumptions hold, while being less redundant than traditional methods. We provide a scoped analysis of discrete-to-continuous conservatism and compact-domain objective regularity, and validate on synthetic data and real-world datasets, including multipath, ionospheric, and tropospheric residual errors. Across these settings, the method yields tighter bounds while maintaining conservatism on the enforced grid and in experiments. The framework is modality-agnostic and applicable to learning systems that require conservative, feature-conditioned uncertainty estimates in dynamic environments.
CVJul 4, 2025Code
SAMed-2: Selective Memory Enhanced Medical Segment Anything ModelZhiling Yan, Sifan Song, Dingjie Song et al.
Recent "segment anything" efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: https://github.com/ZhilingYan/Medical-SAM-Bench.
CVJul 11, 2022
2nd Place Solution to Google Landmark Retrieval 2020Min Yang, Cheng Cui, Xuetong Xue et al.
This paper presents the 2nd place solution to the Google Landmark Retrieval Competition 2020. We propose a training method of global feature model for landmark retrieval without post-processing, such as local feature and spatial verification. There are two parts in our retrieval method in this competition. This training scheme mainly includes training by increasing margin value of arcmargin loss and increasing image resolution step by step. Models are trained by PaddlePaddle framework and Pytorch framework, and then converted to tensorflow 2.2. Using this method, we got a public score of 0.40176 and a private score of 0.36278 and achieved 2nd place in the Google Landmark Retrieval Competition 2020.
CVApr 4, 2024Code
SP$^2$OT: Semantic-Regularized Progressive Partial Optimal Transport for Imbalanced ClusteringChuyu Zhang, Hui Ren, Xuming He
Deep clustering, which learns representation and semantic clustering without labels information, poses a great challenge for deep learning-based approaches. Despite significant progress in recent years, most existing methods focus on uniformly distributed datasets, significantly limiting the practical applicability of their methods. In this paper, we propose a more practical problem setting named deep imbalanced clustering, where the underlying classes exhibit an imbalance distribution. To address this challenge, we introduce a novel optimal transport-based pseudo-label learning framework. Our framework formulates pseudo-label generation as a Semantic-regularized Progressive Partial Optimal Transport (SP$^2$OT) problem, which progressively transports each sample to imbalanced clusters under prior and semantic relation constraints, thus generating high-quality and imbalance-aware pseudo-labels. To solve the SP$^2$OT problem, we propose a projected mirror descent algorithm, which alternates between: (1) computing the gradient of the SP$^2$OT objective, and (2) performing gradient descent with projection via an entropy-regularized progressive partial optimal transport formulation. Furthermore, we formulate the second step as an unbalanced optimal transport problem with augmented constraints and develop an efficient solution based on fast matrix scaling algorithms. Experiments on various datasets, including a human-curated long-tailed CIFAR100, challenging ImageNet-R, and large-scale subsets of fine-grained iNaturalist2018 datasets, demonstrate the superiority of our method. Code is available: https://github.com/rhfeiyang/SPPOT
CLMay 26, 2023Code
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical TasksKai Zhang, Rong Zhou, Eashan Adhikarla et al.
Traditional biomedical artificial intelligence (AI) models, designed for specific tasks or modalities, often exhibit limited flexibility in real-world deployment and struggle to utilize holistic information. Generalist AI holds the potential to address these limitations due to its versatility in interpreting different data types and generating tailored outputs for diverse needs. However, existing biomedical generalist AI solutions are typically heavyweight and closed source to researchers, practitioners, and patients. Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model, designed as a generalist capable of performing various biomedical tasks. BiomedGPT achieved state-of-the-art results in 16 out of 25 experiments while maintaining a computing-friendly model scale. We also conducted human evaluations to assess the capabilities of BiomedGPT in radiology visual question answering, report generation, and summarization. BiomedGPT exhibits robust prediction ability with a low error rate of 3.8% in question answering, satisfactory performance with an error rate of 8.3% in writing complex radiology reports, and competitive summarization ability with a nearly equivalent preference score to human experts. Our method demonstrates that effective training with diverse data can lead to more practical biomedical AI for improving diagnosis and workflow efficiency.
CVJun 10, 2019Code
2nd Place and 2nd Place Solution to Kaggle Landmark Recognition andRetrieval Competition 2019Kaibing Chen, Cheng Cui, Yuning Du et al.
We present a retrieval based system for landmark retrieval and recognition challenge.There are five parts in retrieval competition system, including feature extraction and matching to get candidates queue; database augmentation and query extension searching; reranking from recognition results and local feature matching. In recognition challenge including: landmark and non-landmark recognition, multiple recognition results voting and reranking using combination of recognition and retrieval results. All of models trained and predicted by PaddlePaddle framework. Using our method, we achieved 2nd place in the Google Landmark Recognition 2019 and 2nd place in the Google Landmark Retrieval 2019 on kaggle. The source code is available at here.
CVMay 8
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual EvidenceHanqi Jiang, Junhao Chen, Yi Pan et al.
Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
CVMar 11, 2024
Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology PromptingWenting Chen, Pengyu Wang, Hui Ren et al.
Data scarcity and privacy concerns limit the availability of high-quality medical images for public use, which can be mitigated through medical image synthesis. However, current medical image synthesis methods often struggle to accurately capture the complexity of detailed anatomical structures and pathological conditions. To address these challenges, we propose a novel medical image synthesis model that leverages fine-grained image-text alignment and anatomy-pathology prompts to generate highly detailed and accurate synthetic medical images. Our method integrates advanced natural language processing techniques with image generative modeling, enabling precise alignment between descriptive text prompts and the synthesized images' anatomical and pathological details. The proposed approach consists of two key components: an anatomy-pathology prompting module and a fine-grained alignment-based synthesis module. The anatomy-pathology prompting module automatically generates descriptive prompts for high-quality medical images. To further synthesize high-quality medical images from the generated prompts, the fine-grained alignment-based synthesis module pre-defines a visual codebook for the radiology dataset and performs fine-grained alignment between the codebook and generated prompts to obtain key patches as visual clues, facilitating accurate image synthesis. We validate the superiority of our method through experiments on public chest X-ray datasets and demonstrate that our synthetic images preserve accurate semantic information, making them valuable for various medical applications.
CVJan 17, 2024
P$^2$OT: Progressive Partial Optimal Transport for Deep Imbalanced ClusteringChuyu Zhang, Hui Ren, Xuming He
Deep clustering, which learns representation and semantic clustering without labels information, poses a great challenge for deep learning-based approaches. Despite significant progress in recent years, most existing methods focus on uniformly distributed datasets, significantly limiting the practical applicability of their methods. In this paper, we first introduce a more practical problem setting named deep imbalanced clustering, where the underlying classes exhibit an imbalance distribution. To tackle this problem, we propose a novel pseudo-labeling-based learning framework. Our framework formulates pseudo-label generation as a progressive partial optimal transport problem, which progressively transports each sample to imbalanced clusters under prior distribution constraints, thus generating imbalance-aware pseudo-labels and learning from high-confident samples. In addition, we transform the initial formulation into an unbalanced optimal transport problem with augmented constraints, which can be solved efficiently by a fast matrix scaling algorithm. Experiments on various datasets, including a human-curated long-tailed CIFAR100, challenging ImageNet-R, and large-scale subsets of fine-grained iNaturalist2018 datasets, demonstrate the superiority of our method.
CVMar 26, 2025
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature FieldsShijie Zhou, Hui Ren, Yijia Weng et al.
Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g., SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
CVOct 30, 2024
EchoFM: Foundation Model for Generalizable Echocardiogram AnalysisSekeun Kim, Pengfei Jin, Sifan Song et al.
Foundation models have recently gained significant attention because of their generalizability and adaptability across multiple tasks and data distributions. Although medical foundation models have emerged, solutions for cardiac imaging, especially echocardiography videos, are still unexplored. In this paper, we introduce EchoFM, a foundation model specifically designed to represent and analyze echocardiography videos. In EchoFM, we propose a self-supervised learning framework that captures both spatial and temporal variability patterns through a spatio-temporal consistent masking strategy and periodic-driven contrastive learning. This framework can effectively capture the spatio-temporal dynamics of echocardiography and learn the representative video features without any labels. We pre-train our model on an extensive dataset comprising over 290,000 echocardiography videos covering 26 scan views across different imaging modes, with up to 20 million frames of images. The pre-trained EchoFM can then be easily adapted and fine-tuned for a variety of downstream tasks, serving as a robust backbone model. Our evaluation was systemically designed for four downstream tasks after the echocardiography examination routine. Experiment results show that EchoFM surpasses state-of-the-art methods, including specialized echocardiography methods, self-supervised pre-training models, and general-purposed pre-trained foundation models, across all downstream tasks.
IVMar 11, 2024
Conditional Score-Based Diffusion Model for Cortical Thickness Trajectory PredictionQing Xiao, Siyeop Yoon, Hui Ren et al.
Alzheimer's Disease (AD) is a neurodegenerative condition characterized by diverse progression rates among individuals, with changes in cortical thickness (CTh) closely linked to its progression. Accurately forecasting CTh trajectories can significantly enhance early diagnosis and intervention strategies, providing timely care. However, the longitudinal data essential for these studies often suffer from temporal sparsity and incompleteness, presenting substantial challenges in modeling the disease's progression accurately. Existing methods are limited, focusing primarily on datasets without missing entries or requiring predefined assumptions about CTh progression. To overcome these obstacles, we propose a conditional score-based diffusion model specifically designed to generate CTh trajectories with the given baseline information, such as age, sex, and initial diagnosis. Our conditional diffusion model utilizes all available data during the training phase to make predictions based solely on baseline information during inference without needing prior history about CTh progression. The prediction accuracy of the proposed CTh prediction pipeline using a conditional score-based model was compared for sub-groups consisting of cognitively normal, mild cognitive impairment, and AD subjects. The Bland-Altman analysis shows our diffusion-based prediction model has a near-zero bias with narrow 95% confidential interval compared to the ground-truth CTh in 6-36 months. In addition, our conditional diffusion model has a stochastic generative nature, therefore, we demonstrated an uncertainty analysis of patient-specific CTh prediction through multiple realizations.
CVNov 29, 2024
Opt-In Art: Learning Art Styles Only from Few ExamplesHui Ren, Joanna Materzynska, Rohit Gandikota et al.
We explore whether pre-training on datasets with paintings is necessary for a model to learn an artistic style with only a few examples. To investigate this, we train a text-to-image model exclusively on photographs, without access to any painting-related content. We show that it is possible to adapt a model that is trained without paintings to an artistic style, given only few examples. User studies and automatic evaluations confirm that our model (post-adaptation) performs on par with state-of-the-art models trained on massive datasets that contain artistic content like paintings, drawings or illustrations. Finally, using data attribution techniques, we analyze how both artistic and non-artistic datasets contribute to generating artistic-style images. Surprisingly, our findings suggest that high-quality artistic outputs can be achieved without prior exposure to artistic data, indicating that artistic style generation can occur in a controlled, opt-in manner using only a limited, carefully selected set of training examples.
CVJun 19, 2024
Biomedical Visual Instruction Tuning with Clinician Preference AlignmentHejie Cui, Lingjun Mao, Xin Liang et al.
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at BioMed-VITAL.github.io.
CVApr 9, 2024
Prompt-driven Universal Model for View-Agnostic Echocardiography AnalysisSekeun Kim, Hui Ren, Peng Guo et al.
Echocardiography segmentation for cardiac analysis is time-consuming and resource-intensive due to the variability in image quality and the necessity to process scans from various standard views. While current automated segmentation methods in echocardiography show promising performance, they are trained on specific scan views to analyze corresponding data. However, this solution has a limitation as the number of required models increases with the number of standard views. To address this, in this paper, we present a prompt-driven universal method for view-agnostic echocardiography analysis. Considering the domain shift between standard views, we first introduce a method called prompt matching, aimed at learning prompts specific to different views by matching prompts and querying input embeddings using a pre-trained vision model. Then, we utilized a pre-trained medical language model to align textual information with pixel data for accurate segmentation. Extensive experiments on three standard views showed that our approach significantly outperforms the state-of-the-art universal methods and achieves comparable or even better performances over the segmentation model trained and tested on same views.
IVMar 15, 2024
Temporal-spatial Adaptation of Promptable SAM Enhance Accuracy and Generalizability of cine CMR SegmentationZhennong Chen, Sekeun Kim, Hui Ren et al.
Accurate myocardium segmentation across all phases in one cardiac cycle in cine cardiac magnetic resonance (CMR) scans is crucial for comprehensively cardiac function analysis. Despite advancements in deep learning (DL) for automatic cine CMR segmentation, generalizability on unseen data remains a significant challenge. Recently, the segment-anything-model (SAM) has been invented as a segmentation foundation model, known for its accurate segmentation and more importantly, zero-shot generalization. SAM was trained on two-dimensional (2D) natural images; to adapt it for comprehensive cine CMR segmentation, we propose cineCMR-SAM which incorporates both temporal and spatial information through a modified model architecture. Compared to other state-of-the-art (SOTA) methods, our model achieved superior data-specific model segmentation accuracy on the STACOM2011 when fine-tuned on this dataset and demonstrated superior zero-shot generalization on two other large public datasets (ACDC and M&Ms) unseen during fine-tuning. Additionally, we introduced a text prompt feature in cineCMR-SAM to specify the view type of input slices (short-axis or long-axis), enhancing performance across all view types.
LGMar 21, 2021
Development and Validation of a Deep Learning Model for Prediction of Severe Outcomes in Suspected COVID-19 InfectionVarun Buch, Aoxiao Zhong, Xiang Li et al.
COVID-19 patient triaging with predictive outcome of the patients upon first present to emergency department (ED) is crucial for improving patient prognosis, as well as better hospital resources management and cross-infection control. We trained a deep feature fusion model to predict patient outcomes, where the model inputs were EHR data including demographic information, co-morbidities, vital signs and laboratory measurements, plus patient's CXR images. The model output was patient outcomes defined as the most insensitive oxygen therapy required. For patients without CXR images, we employed Random Forest method for the prediction. Predictive risk scores for COVID-19 severe outcomes ("CO-RISK" score) were derived from model output and evaluated on the testing dataset, as well as compared to human performance. The study's dataset (the "MGB COVID Cohort") was constructed from all patients presenting to the Mass General Brigham (MGB) healthcare system from March 1st to June 1st, 2020. ED visits with incomplete or erroneous data were excluded. Patients with no test order for COVID or confirmed negative test results were excluded. Patients under the age of 15 were also excluded. Finally, electronic health record (EHR) data from a total of 11060 COVID-19 confirmed or suspected patients were used in this study. Chest X-ray (CXR) images were also collected from each patient if available. Results show that CO-RISK score achieved area under the Curve (AUC) of predicting MV/death (i.e. severe outcomes) in 24 hours of 0.95, and 0.92 in 72 hours on the testing dataset. The model shows superior performance to the commonly used risk scores in ED (CURB-65 and MEWS). Comparing with physician's decisions, CO-RISK score has demonstrated superior performance to human in making ICU/floor decisions.
IVNov 26, 2020
Deep Metric Learning-based Image Retrieval System for Chest Radiograph and its Clinical Applications in COVID-19Aoxiao Zhong, Xiang Li, Dufan Wu et al.
In recent years, deep learning-based image analysis methods have been widely applied in computer-aided detection, diagnosis and prognosis, and has shown its value during the public health crisis of the novel coronavirus disease 2019 (COVID-19) pandemic. Chest radiograph (CXR) has been playing a crucial role in COVID-19 patient triaging, diagnosing and monitoring, particularly in the United States. Considering the mixed and unspecific signals in CXR, an image retrieval model of CXR that provides both similar images and associated clinical information can be more clinically meaningful than a direct image diagnostic model. In this work we develop a novel CXR image retrieval model based on deep metric learning. Unlike traditional diagnostic models which aims at learning the direct mapping from images to labels, the proposed model aims at learning the optimized embedding space of images, where images with the same labels and similar contents are pulled together. It utilizes multi-similarity loss with hard-mining sampling strategy and attention mechanism to learn the optimized embedding space, and provides similar images to the query image. The model is trained and validated on an international multi-site COVID-19 dataset collected from 3 different sources. Experimental results of COVID-19 image retrieval and diagnosis tasks show that the proposed model can serve as a robust solution for CXR analysis and patient management for COVID-19. The model is also tested on its transferability on a different clinical decision support task, where the pre-trained model is applied to extract image features from a new dataset without any further training. These results demonstrate our deep metric learning based image retrieval model is highly efficient in the CXR retrieval, diagnosis and prognosis, and thus has great clinical value for the treatment and management of COVID-19 patients.
IVSep 26, 2020
Deep Learning-based Four-region Lung Segmentation in Chest Radiography for COVID-19 DiagnosisYoung-Gon Kim, Kyungsang Kim, Dufan Wu et al.
Purpose. Imaging plays an important role in assessing severity of COVID 19 pneumonia. However, semantic interpretation of chest radiography (CXR) findings does not include quantitative description of radiographic opacities. Most current AI assisted CXR image analysis framework do not quantify for regional variations of disease. To address these, we proposed a four region lung segmentation method to assist accurate quantification of COVID 19 pneumonia. Methods. A segmentation model to separate left and right lung is firstly applied, and then a carina and left hilum detection network is used, which are the clinical landmarks to separate the upper and lower lungs. To improve the segmentation performance of COVID 19 images, ensemble strategy incorporating five models is exploited. Using each region, we evaluated the clinical relevance of the proposed method with the Radiographic Assessment of the Quality of Lung Edema (RALE). Results. The proposed ensemble strategy showed dice score of 0.900, which is significantly higher than conventional methods (0.854 0.889). Mean intensities of segmented four regions indicate positive correlation to the extent and density scores of pulmonary opacities under the RALE framework. Conclusion. A deep learning based model in CXR can accurately segment and quantify regional distribution of pulmonary opacities in patients with COVID 19 pneumonia.
MED-PHMay 19, 2020
Self-supervised Dynamic CT Perfusion Image Denoising with Deep Neural NetworksDufan Wu, Hui Ren, Quanzheng Li
Dynamic computed tomography perfusion (CTP) imaging is a promising approach for acute ischemic stroke diagnosis and evaluation. Hemodynamic parametric maps of cerebral parenchyma are calculated from repeated CT scans of the first pass of iodinated contrast through the brain. It is necessary to reduce the dose of CTP for routine applications due to the high radiation exposure from the repeated scans, where image denoising is necessary to achieve a reliable diagnosis. In this paper, we proposed a self-supervised deep learning method for CTP denoising, which did not require any high-dose reference images for training. The network was trained by mapping each frame of CTP to an estimation from its adjacent frames. Because the noise in the source and target was independent, this approach could effectively remove the noise. Being free from high-dose training images granted the proposed method easier adaptation to different scanning protocols. The method was validated on both simulation and a public real dataset. The proposed method achieved improved image quality compared to conventional denoising methods. On the real data, the proposed method also had improved spatial resolution and contrast-to-noise ratio compared to supervised learning which was trained on the simulation data
CVDec 12, 2019
Zooming into Face Forensics: A Pixel-level AnalysisJia Li, Tong Shen, Wei Zhang et al.
The stunning progress in face manipulation methods has made it possible to synthesize realistic fake face images, which poses potential threats to our society. It is urgent to have face forensics techniques to distinguish those tampered images. A large scale dataset "FaceForensics++" has provided enormous training data generated from prominent face manipulation methods to facilitate anti-fake research. However, previous works focus more on casting it as a classification problem by only considering a global prediction. Through investigation to the problem, we find that training a classification network often fails to capture high quality features, which might lead to sub-optimal solutions. In this paper, we zoom in on the problem by conducting a pixel-level analysis, i.e. formulating it as a pixel-level segmentation task. By evaluating multiple architectures on both segmentation and classification tasks, We show the superiority of viewing the problem from a segmentation perspective. Different ablation studies are also performed to investigate what makes an effective and efficient anti-fake model. Strong baselines are also established, which, we hope, could shed some light on the field of face forensics.
CVAug 6, 2018
Multi-Estimator Full Left Ventricle Quantification through Ensemble LearningJiasha Liu, Xiang Li, Hui Ren et al.
Cardiovascular disease accounts for 1 in every 4 deaths in United States. Accurate estimation of structural and functional cardiac parameters is crucial for both diagnosis and disease management. In this work, we develop an ensemble learning framework for more accurate and robust left ventricle (LV) quantification. The framework combines two 1st-level modules: direct estimation module and a segmentation module. The direct estimation module utilizes Convolutional Neural Network (CNN) to achieve end-to-end quantification. The CNN is trained by taking 2D cardiac images as input and cardiac parameters as output. The segmentation module utilizes a U-Net architecture for obtaining pixel-wise prediction of the epicardium and endocardium of LV from the background. The binary U-Net output is then analyzed by a separate CNN for estimating the cardiac parameters. We then employ linear regression between the 1st-level predictor and ground truth to learn a 2nd-level predictor that ensembles the results from 1st-level modules for the final estimation. Preliminary results by testing the proposed framework on the LVQuan18 dataset show superior performance of the ensemble learning model over the two base modules.