Olivier Gevaert

h-index15

27papers

927citations

Novelty43%

AI Score56

Ranked #22,557 of 201,326 authors (top 11%)#5,160 in LG (top 12%)

27 Papers

IVMar 15, 2022

Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set

Roxana Daneshjou, Kailas Vodrahalli, Roberto A Novoa et al. · stanford

Access to dermatological care is a major issue, with an estimated 3 billion people lacking access to care globally. Artificial intelligence (AI) may aid in triaging skin diseases. However, most AI models have not been rigorously assessed on images of diverse skin tones or uncommon diseases. To ascertain potential biases in algorithm performance in this context, we curated the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. Using this dataset of 656 images, we show that state-of-the-art dermatology AI models perform substantially worse on DDI, with receiver operator curve area under the curve (ROC-AUC) dropping by 27-36 percent compared to the models' original test results. All the models performed worse on dark skin tones and uncommon diseases, which are represented in the DDI dataset. Additionally, we find that dermatologists, who typically provide visual labels for AI training and test datasets, also perform worse on images of dark skin tones and uncommon diseases compared to ground truth biopsy annotations. Finally, fine-tuning AI models on the well-characterized and diverse DDI images closed the performance gap between light and dark skin tones. Moreover, algorithms fine-tuned on diverse skin tones outperformed dermatologists on identifying malignancy on images of dark skin tones. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and diseases.

44.1AIMay 30

SDR: Set-Distance Rewards for Radiology Report Generation

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge et al.

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

IVJul 5, 2024Code

Unraveling Radiomics Complexity: Strategies for Optimal Simplicity in Predictive Modeling

Mahdi Ait Lhaj Loutfi, Teodora Boblea Podasca, Alex Zwanenburg et al.

Background: The high dimensionality of radiomic feature sets, the variability in radiomic feature types and potentially high computational requirements all underscore the need for an effective method to identify the smallest set of predictive features for a given clinical problem. Purpose: Develop a methodology and tools to identify and explain the smallest set of predictive radiomic features. Materials and Methods: 89,714 radiomic features were extracted from five cancer datasets: low-grade glioma, meningioma, non-small cell lung cancer (NSCLC), and two renal cell carcinoma cohorts (n=2104). Features were categorized by computational complexity into morphological, intensity, texture, linear filters, and nonlinear filters. Models were trained and evaluated on each complexity level using the area under the curve (AUC). The most informative features were identified, and their importance was explained. The optimal complexity level and associated most informative features were identified using systematic statistical significance analyses and a false discovery avoidance procedure, respectively. Their predictive importance was explained using a novel tree-based method. Results: MEDimage, a new open-source tool, was developed to facilitate radiomic studies. Morphological features were optimal for MRI-based meningioma (AUC: 0.65) and low-grade glioma (AUC: 0.68). Intensity features were optimal for CECT-based renal cell carcinoma (AUC: 0.82) and CT-based NSCLC (AUC: 0.76). Texture features were optimal for MRI-based renal cell carcinoma (AUC: 0.72). Tuning the Hounsfield unit range improved results for CECT-based renal cell carcinoma (AUC: 0.86). Conclusion: Our proposed methodology and software can estimate the optimal radiomics complexity level for specific medical outcomes, potentially simplifying the use of radiomics in predictive modeling across various contexts.

IRMar 9, 2022

Filter Drug-induced Liver Injury Literature with Natural Language Processing and Ensemble Learning

Xianghao Zhan, Fanjin Wang, Olivier Gevaert

Drug-induced liver injury (DILI) describes the adverse effects of drugs that damage liver. Life-threatening results including liver failure or death were also reported in severe DILI cases. Therefore, DILI-related events are strictly monitored for all approved drugs and the liver toxicity became important assessments for new drug candidates. These DILI-related reports are documented in hospital records, in clinical trial results, and also in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from previous publications relies heavily on resource-demanding manual labelling, which considerably decreased the efficiency of the information extraction process. The recent development of artificial intelligence, particularly, the rise of natural language processing (NLP) techniques, enabled the automatic processing of biomedical texts. In this study, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, we benchmarked model performances on filtering out DILI literature. Among four word vectorization techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 with our in-house test set. Furthermore, an ensemble model with similar overall performances was implemented and was fine-tuned to lower the false-negative cases to avoid neglecting potential DILI reports. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data provided by the CAMDA committee. Moreover, important words in positive/negative predictions were identified via model interpretation. Overall, the ensemble model reached satisfactory classification results, which can be further used by researchers to rapidly filter DILI-related literature.

LGNov 4, 2023

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects

Elisa Warner, Joonsang Lee, William Hsu et al.

Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models. This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems. Emphasizing challenges and innovations in addressing multimodal representation, fusion, translation, alignment, and co-learning, the paper explores the transformative potential of multimodal models for clinical predictions. It also highlights the need for principled assessments and practical implementation of such models, bringing attention to the dynamics between decision support systems and healthcare providers and personnel. Despite advancements, challenges such as data biases and the scarcity of "big data" in many biomedical domains persist. We conclude with a discussion on principled innovation and collaborative efforts to further the mission of seamless integration of multimodal ML models into biomedical practice.

LGDec 19, 2022

Denoising instrumented mouthguard measurements of head impact kinematics with a convolutional neural network

Xianghao Zhan, Yuzhe Liu, Nicholas J. Cecchi et al.

Wearable sensors for measuring head kinematics can be noisy due to imperfect interfaces with the body. Mouthguards are used to measure head kinematics during impacts in traumatic brain injury (TBI) studies, but deviations from reference kinematics can still occur due to potential looseness. In this study, deep learning is used to compensate for the imperfect interface and improve measurement accuracy. A set of one-dimensional convolutional neural network (1D-CNN) models was developed to denoise mouthguard kinematics measurements along three spatial axes of linear acceleration and angular velocity. The denoised kinematics had significantly reduced errors compared to reference kinematics, and reduced errors in brain injury criteria and tissue strain and strain rate calculated via finite element modeling. The 1D-CNN models were also tested on an on-field dataset of college football impacts and a post-mortem human subject dataset, with similar denoising effects observed. The models can be used to improve detection of head impacts and TBI risk evaluation, and potentially extended to other sensors measuring kinematics.

86.1LGApr 10Code

Improving Medical VQA through Trajectory-Aware Process Supervision

Halil Ibrahim Gulluk, Olivier Gevaert

Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at https://anonymous.4open.science/r/MICCAI-R1-MED-VQA-code-B14B/

LGJun 8, 2023

Toward more accurate and generalizable brain deformation estimators for traumatic brain injury detection with unsupervised domain adaptation

Xianghao Zhan, Jiawei Sun, Yuzhe Liu et al.

Machine learning head models (MLHMs) are developed to estimate brain deformation for early detection of traumatic brain injury (TBI). However, the overfitting to simulated impacts and the lack of generalizability caused by distributional shift of different head impact datasets hinders the broad clinical applications of current MLHMs. We propose brain deformation estimators that integrates unsupervised domain adaptation with a deep neural network to predict whole-brain maximum principal strain (MPS) and MPS rate (MPSR). With 12,780 simulated head impacts, we performed unsupervised domain adaptation on on-field head impacts from 302 college football (CF) impacts and 457 mixed martial arts (MMA) impacts using domain regularized component analysis (DRCA) and cycle-GAN-based methods. The new model improved the MPS/MPSR estimation accuracy, with the DRCA method significantly outperforming other domain adaptation methods in prediction accuracy (p<0.001): MPS RMSE: 0.027 (CF) and 0.037 (MMA); MPSR RMSE: 7.159 (CF) and 13.022 (MMA). On another two hold-out test sets with 195 college football impacts and 260 boxing impacts, the DRCA model significantly outperformed the baseline model without domain adaptation in MPS and MPSR estimation accuracy (p<0.001). The DRCA domain adaptation reduces the MPS/MPSR estimation error to be well below TBI thresholds, enabling accurate brain deformation estimation to detect TBI in future clinical applications.

26.7CVMay 19Code

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

Halil Ibrahim Gulluk, Olivier Gevaert

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

LGSep 13, 2023

Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining

Xianghao Zhan, Qinmei Xu, Yuanning Zheng et al.

Accurately labeling biomedical data presents a challenge. Traditional semi-supervised learning methods often under-utilize available unlabeled data. To address this, we propose a novel reliability-based training data cleaning method employing inductive conformal prediction (ICP). This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers within vast quantities of noisy training data. The efficacy of the method is validated across three classification tasks within distinct modalities: filtering drug-induced-liver-injury (DILI) literature with title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced through label permutation. Results show significant enhancements in classification performance: accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4%), AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% and 69.8%), and accuracy and macro-average F1 score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% and 89.0%). Our method offers the potential to substantially boost classification performance in multi-modal biomedical machine learning tasks. Importantly, it accomplishes this without necessitating an excessive volume of meticulously curated training data.

CLSep 21, 2023

Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI

Mahyar Abbasian, Elahe Khatibi, Iman Azimi et al.

Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present an comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.

SPSep 12, 2024

Identification of head impact locations, speeds, and force based on head kinematics

Xianghao Zhan, Yuzhe Liu, Nicholas J. Cecchi et al.

Objective: Head impact information including impact directions, speeds and force are important to study traumatic brain injury, design and evaluate protective gears. This study presents a deep learning model developed to accurately predict head impact information, including location, speed, orientation, and force, based on head kinematics during helmeted impacts. Methods: Leveraging a dataset of 16,000 simulated helmeted head impacts using the Riddell helmet finite element model, we implemented a Long Short-Term Memory (LSTM) network to process the head kinematics: tri-axial linear accelerations and angular velocities. Results: The models accurately predict the impact parameters describing impact location, direction, speed, and the impact force profile with R2 exceeding 70% for all tasks. Further validation was conducted using an on-field dataset recorded by instrumented mouthguards and videos, consisting of 79 head impacts in which the impact location can be clearly identified. The deep learning model significantly outperformed existing methods, achieving a 79.7% accuracy in identifying impact locations, compared to lower accuracies with traditional methods (the highest accuracy of existing methods is 49.4%). Conclusion: The precision underscores the model's potential in enhancing helmet design and safety in sports by providing more accurate impact data. Future studies should test the models across various helmets and sports on large in vivo datasets to validate the accuracy of the models, employing techniques like transfer learning to broaden its effectiveness.

LGNov 21, 2023

Towards a more inductive world for drug repurposing approaches

Jesus de la Fuente, Guillermo Serrano, Uxía Veleiro et al.

Drug-target interaction (DTI) prediction is a challenging, albeit essential task in drug repurposing. Learning on graph models have drawn special attention as they can significantly reduce drug repurposing costs and time commitment. However, many current approaches require high-demanding additional information besides DTIs that complicates their evaluation process and usability. Additionally, structural differences in the learning architecture of current models hinder their fair benchmarking. In this work, we first perform an in-depth evaluation of current DTI datasets and prediction models through a robust benchmarking process, and show that DTI prediction methods based on transductive models lack generalization and lead to inflated performance when evaluated as previously done in the literature, hence not being suited for drug repurposing approaches. We then propose a novel biologically-driven strategy for negative edge subsampling and show through in vitro validation that newly discovered interactions are indeed true. We envision this work as the underpinning for future fair benchmarking and robust model design. All generated resources and tools are publicly available as a python package.

67.4LGApr 10

SemEnrich: Self-Supervised Semantic Enrichment of Radiology Reports for Vision-Language Learning

Halil Ibrahim Gulluk, Olivier Gevaert

Medical vision-language datasets are often limited in size and biased toward negative findings, as clinicians report abnormalities mostly but might omit some positive/neutral findings because they might be considered as irrelevant to the patient's condition. We propose a self-supervised data enrichment method that leverages semantic clustering of report sentences. Then we enrich the findings in the medical reports in the training set by adding positive/neutral observations from different clusters in a self-supervised manner. Our approach yields consistent gains in supervised fine-tuning (5.63%, 3.04%, 7.40%, 5.30%, 7.47% average gains on COMET score, Bert score, Sentence Bleu, CheXbert-F1 and RadGraph-F1 scores respectively). Ablation studies confirm that improvements stem from semantic clustering rather than random augmentation. Furthermore, we introduce a way to incorporate semantic cluster information into the reward design for GRPO training, which leads to further performance gains (2.78%, 3.14%, 12.80% average gains on COMET score, Bert score and Sentence Bleu scores respectively). We share our code at https://anonymous.4open.science/r/SemEnrich-75CF

CVJan 23, 2025

Prior Knowledge Injection into Deep Learning Models Predicting Gene Expression from Whole Slide Images

Max Hallemeesch, Marija Pizurica, Paloma Rabaey et al.

Cancer diagnosis and prognosis primarily depend on clinical parameters such as age and tumor grade, and are increasingly complemented by molecular data, such as gene expression, from tumor sequencing. However, sequencing is costly and delays oncology workflows. Recent advances in Deep Learning allow to predict molecular information from morphological features within Whole Slide Images (WSIs), offering a cost-effective proxy of the molecular markers. While promising, current methods lack the robustness to fully replace direct sequencing. Here we aim to improve existing methods by introducing a model-agnostic framework that allows to inject prior knowledge on gene-gene interactions into Deep Learning architectures, thereby increasing accuracy and robustness. We design the framework to be generic and flexibly adaptable to a wide range of architectures. In a case study on breast cancer, our strategy leads to an average increase of 983 significant genes (out of 25,761) across all 18 experiments, with 14 generalizing to an increase on an independent dataset. Our findings reveal a high potential for injection of prior knowledge to increase gene expression prediction performance from WSIs across a wide range of architectures.

LGJan 21

SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model

Xianghao Zhan, Jingyu Xu, Yuanning Zheng et al.

Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p < 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.

IVMay 21, 2025

Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets

Qinmei Xu, Yiheng Li, Xianghao Zhan et al.

Foundation models leveraging vision-language pretraining have shown promise in chest X-ray (CXR) interpretation, yet their real-world performance across diverse populations and diagnostic tasks remains insufficiently evaluated. This study benchmarks the diagnostic performance and generalizability of foundation models versus traditional convolutional neural networks (CNNs) on multinational CXR datasets. We evaluated eight CXR diagnostic models - five vision-language foundation models and three CNN-based architectures - across 37 standardized classification tasks using six public datasets from the USA, Spain, India, and Vietnam, and three private datasets from hospitals in China. Performance was assessed using AUROC, AUPRC, and other metrics across both shared and dataset-specific tasks. Foundation models outperformed CNNs in both accuracy and task coverage. MAVL, a model incorporating knowledge-enhanced prompts and structured supervision, achieved the highest performance on public (mean AUROC: 0.82; AUPRC: 0.32) and private (mean AUROC: 0.95; AUPRC: 0.89) datasets, ranking first in 14 of 37 public and 3 of 4 private tasks. All models showed reduced performance on pediatric cases, with average AUROC dropping from 0.88 +/- 0.18 in adults to 0.57 +/- 0.29 in children (p = 0.0202). These findings highlight the value of structured supervision and prompt design in radiologic AI and suggest future directions including geographic expansion and ensemble modeling for clinical deployment. Code for all evaluated models is available at https://drive.google.com/drive/folders/1B99yMQm7bB4h1sVMIBja0RfUu8gLktCE

IVMay 28, 2025

Comparative Analysis of Machine Learning Models for Lung Cancer Mutation Detection and Staging Using 3D CT Scans

Yiheng Li, Francisco Carrillo-Perez, Mohammed Alawad et al.

Lung cancer is the leading cause of cancer mortality worldwide, and non-invasive methods for detecting key mutations and staging are essential for improving patient outcomes. Here, we compare the performance of two machine learning models - FMCIB+XGBoost, a supervised model with domain-specific pretraining, and Dinov2+ABMIL, a self-supervised model with attention-based multiple-instance learning - on 3D lung nodule data from the Stanford Radiogenomics and Lung-CT-PT-Dx cohorts. In the task of KRAS and EGFR mutation detection, FMCIB+XGBoost consistently outperformed Dinov2+ABMIL, achieving accuracies of 0.846 and 0.883 for KRAS and EGFR mutations, respectively. In cancer staging, Dinov2+ABMIL demonstrated competitive generalization, achieving an accuracy of 0.797 for T-stage prediction in the Lung-CT-PT-Dx cohort, suggesting SSL's adaptability across diverse datasets. Our results emphasize the clinical utility of supervised models in mutation detection and highlight the potential of SSL to improve staging generalization, while identifying areas for enhancement in mutation sensitivity.

IVNov 15, 2021

Disparities in Dermatology AI: Assessments Using Diverse Clinical Images

Roxana Daneshjou, Kailas Vodrahalli, Weixin Liang et al.

More than 3 billion people lack access to care for skin disease. AI diagnostic tools may aid in early skin cancer detection; however most models have not been assessed on images of diverse skin tones or uncommon diseases. To address this, we curated the Diverse Dermatology Images (DDI) dataset - the first publicly available, pathologically confirmed images featuring diverse skin tones. We show that state-of-the-art dermatology AI models perform substantially worse on DDI, with ROC-AUC dropping 29-40 percent compared to the models' original results. We find that dark skin tones and uncommon diseases, which are well represented in the DDI dataset, lead to performance drop-offs. Additionally, we show that state-of-the-art robust training methods cannot correct for these biases without diverse training data. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and across all disease.

QMOct 27, 2021

Data-driven decomposition of brain dynamics with principal component analysis in different types of head impacts

Xianghao Zhan, Yuzhe Liu, Nicholas J. Cecchi et al.

Strain and strain rate are effective traumatic brain injury predictors. Kinematics-based models estimating these metrics suffer from significant different distributions of both kinematics and the injury metrics across head impact types. To address this, previous studies focus on the kinematics but not the injury metrics. We have previously shown the kinematic features vary largely across head impact types, resulting in different patterns of brain deformation. This study analyzes the spatial distribution of brain deformation and applies principal component analysis (PCA) to extract the representative patterns of injury metrics (maximum principal strain (MPS), MPS rate (MPSR) and MPSXMPSR) in four impact types (simulation, football, mixed martial arts and car crashes). We apply PCA to decompose the patterns of the injury metrics for all impacts in each impact type, and investigate the distributions among brain regions using the first principal component (PC1). Furthermore, we developed a deep learning head model (DLHM) to predict PC1 and then inverse-transform to predict for all brain elements. PC1 explained >80% variance on the datasets. Based on PC1 coefficients, the corpus callosum and midbrain exhibit high variance on all datasets. We found MPSXMPSR the most sensitive metric on which the top 5% of severe impacts further deviates from the mean and there is a higher variance among the severe impacts. Finally, the DLHM reached mean absolute errors of <0.018 for MPS, <3.7 (1/s) for MPSR and <1.1 (1/s) for MPSXMPSR, much smaller than the injury thresholds. The brain injury metric in a dataset can be decomposed into mean components and PC1 with high explained variance. The brain dynamics decomposition enables better interpretation of the patterns in brain injury metrics and the sensitivity of brain injury metrics across impact types. The decomposition also reduces the dimensionality of DLHM.

LGAug 31, 2021

Rapidly and accurately estimating brain strain and strain rate across head impact types with transfer learning and data fusion

Xianghao Zhan, Yuzhe Liu, Nicholas J. Cecchi et al.

Brain strain and strain rate are effective in predicting traumatic brain injury (TBI) caused by head impacts. However, state-of-the-art finite element modeling (FEM) demands considerable computational time in the computation, limiting its application in real-time TBI risk monitoring. To accelerate, machine learning head models (MLHMs) were developed, and the model accuracy was found to decrease when the training/test datasets were from different head impacts types. However, the size of dataset for specific impact types may not be enough for model training. To address the computational cost of FEM, the limited strain rate prediction, and the generalizability of MLHMs to on-field datasets, we propose data fusion and transfer learning to develop a series of MLHMs to predict the maximum principal strain (MPS) and maximum principal strain rate (MPSR). We trained and tested the MLHMs on 13,623 head impacts from simulations, American football, mixed martial arts, car crash, and compared against the models trained on only simulations or only on-field impacts. The MLHMs developed with transfer learning are significantly more accurate in estimating MPS and MPSR than other models, with a mean absolute error (MAE) smaller than 0.03 in predicting MPS and smaller than 7 (1/s) in predicting MPSR on all impact datasets. The MLHMs can be applied to various head impact types for rapidly and accurately calculating brain strain and strain rate. Besides the clinical applications in real-time brain strain and strain rate monitoring, this model helps researchers estimate the brain strain and strain rate caused by head impacts more efficiently than FEM.

APAug 7, 2021

Kinematics clustering enables head impact subtyping for better traumatic brain injury prediction

Xianghao Zhan, Yiheng Li, Yuzhe Liu et al.

Traumatic brain injury can be caused by various types of head impacts. However, due to different kinematic characteristics, many brain injury risk estimation models are not generalizable across the variety of impacts that humans may sustain. The current definitions of head impact subtypes are based on impact sources (e.g., football, traffic accident), which may not reflect the intrinsic kinematic similarities of impacts across the impact sources. To investigate the potential new definitions of impact subtypes based on kinematics, 3,161 head impacts from various sources including simulation, college football, mixed martial arts, and car racing were collected. We applied the K-means clustering to cluster the impacts on 16 standardized temporal features from head rotation kinematics. Then, we developed subtype-specific ridge regression models for cumulative strain damage (using the threshold of 15%), which significantly improved the estimation accuracy compared with the baseline method which mixed impacts from different sources and developed one model (R^2 from 0.7 to 0.9). To investigate the effect of kinematic features, we presented the top three critical features (maximum resultant angular acceleration, maximum angular acceleration along the z-axis, maximum linear acceleration along the y-axis) based on regression accuracy and used logistic regression to find the critical points for each feature that partitioned the subtypes. This study enables researchers to define head impact subtypes in a data-driven manner, which leads to more generalizable brain injury risk estimation.

QMApr 19, 2021

Machine-learning-based head impact subtyping based on the spectral densities of the measurable head kinematics

Xianghao Zhan, Yiheng Li, Yuzhe Liu et al.

Objective: Traumatic brain injury can be caused by head impacts, but many brain injury risk estimation models are not equally accurate across the variety of impacts that patients may undergo and the characteristics of different types of impacts are not well studied. We investigated the spectral characteristics of different head impact types with kinematics classification. Methods: Data was analyzed from 3,262 head impacts from lab reconstruction, American football, mixed martial arts, and publicly available car crash data. A random forest classifier with spectral densities of linear acceleration and angular velocity was built to classify head impact types (e.g., football, car crash, mixed martial arts). To test the classifier robustness, another 271 lab-reconstructed impacts were obtained from 5 other instrumented mouthguards. Finally, with the classifier, type-specific, nearest-neighbor regression models were built for brain strain. Results: The classifier reached a median accuracy of 96% over 1,000 random partitions of training and test sets. The most important features in the classification included both low-frequency and high-frequency features, both linear acceleration features and angular velocity features. Different head impact types had different distributions of spectral densities in low-frequency and high-frequency ranges (e.g., the spectral densities of MMA impacts were higher in high-frequency range than in the low-frequency range). The type-specific regression showed a generally higher R^2-value than baseline models without classification. Conclusion: The machine-learning-based classifier enables a better understanding of the impact kinematics spectral density in different sports, and it can be applied to evaluate the quality of impact-simulation systems and on-field data augmentation.

BIO-PHFeb 9, 2021

Predictive Factors of Kinematics in Traumatic Brain Injury from Head Impacts Based on Statistical Interpretation

Xianghao Zhan, Yiheng Li, Yuzhe Liu et al.

Brain tissue deformation resulting from head impacts is primarily caused by rotation and can lead to traumatic brain injury. To quantify brain injury risk based on measurements of kinematics on the head, finite element (FE) models and various brain injury criteria based on different factors of these kinematics have been developed, but the contribution of different kinematic factors has not been comprehensively analyzed across different types of head impacts in a data-driven manner. To better design brain injury criteria, the predictive power of rotational kinematics factors, which are different in 1) the derivative order (angular velocity, angular acceleration, angular jerk), 2) the direction and 3) the power (e.g., square-rooted, squared, cubic) of the angular velocity, were analyzed based on different datasets including laboratory impacts, American football, mixed martial arts (MMA), NHTSA automobile crashworthiness tests and NASCAR crash events. Ordinary least squares regressions were built from kinematics factors to the 95\% maximum principal strain (MPS95), and we compared zero-order correlation coefficients, structure coefficients, commonality analysis, and dominance analysis. The angular acceleration, the magnitude, and the first power factors showed the highest predictive power for the majority of impacts including laboratory impacts, American football impacts, with few exceptions (angular velocity for MMA and NASCAR impacts). The predictive power of rotational kinematics in three directions (x: posterior-to-anterior, y: left-to-right, z: superior-to-inferior) of kinematics varied with different sports and types of head impacts.

TODec 18, 2020

Relationship between brain injury criteria and brain strain across different types of head impacts can be different

Xianghao Zhan, Yiheng Li, Yuzhe Liu et al.

Multiple brain injury criteria (BIC) are developed to quickly quantify brain injury risks after head impacts. These BIC originated from different types of head impacts (e.g., sports and car crashes) are widely used in risk evaluation. However, the accuracy of using the BIC on brain injury risk estimation across different types of head impacts has not been evaluated. Physiologically, brain strain is often considered the key parameter of brain injury. To evaluate the BIC's risk estimation accuracy across five datasets comprising different head impact types, linear regression was used to model 95% maximum principal strain, 95% maximum principal strain at the corpus callosum, and cumulative strain damage (15%) on each of 18 BIC respectively. The results show a significant difference in the relationship between BIC and brain strain across datasets, indicating the same BIC value may suggest different brain strain in different head impact types. The accuracy of brain strain regression is generally decreasing if the BIC regression models are fit on a dataset with a different type of head impact rather than on the dataset with the same type. Given this finding, this study raises concerns for applying BIC to estimate the brain injury risks for head impacts different from the head impacts on which the BIC was developed.

TOOct 16, 2020

Deep Learning Head Model for Real-time Estimation of Entire Brain Deformation in Concussion

Xianghao Zhan, Yuzhe Liu, Samuel J. Raymond et al.

Objective: Many recent studies have suggested that brain deformation resulting from a head impact is linked to the corresponding clinical outcome, such as mild traumatic brain injury (mTBI). Even though several finite element (FE) head models have been developed and validated to calculate brain deformation based on impact kinematics, the clinical application of these FE head models is limited due to the time-consuming nature of FE simulations. This work aims to accelerate the process of brain deformation calculation and thus improve the potential for clinical applications. Methods: We propose a deep learning head model with a five-layer deep neural network and feature engineering, and trained and tested the model on 1803 total head impacts from a combination of head model simulations and on-field college football and mixed martial arts impacts. Results: The proposed deep learning head model can calculate the maximum principal strain for every element in the entire brain in less than 0.001s (with an average root mean squared error of 0.025, and with a standard deviation of 0.002 over twenty repeats with random data partition and model initialization). The contributions of various features to the predictive power of the model were investigated, and it was noted that the features based on angular acceleration were found to be more predictive than the features based on angular velocity. Conclusion: Trained using the dataset of 1803 head impacts, this model can be applied to various sports in the calculation of brain strain with accuracy, and its applicability can even further be extended by incorporating data from other types of head impacts. Significance: In addition to the potential clinical application in real-time brain deformation monitoring, this model will help researchers estimate the brain strain from a large number of head impacts more efficiently than using FE models.

CVNov 14, 2016

3-D Convolutional Neural Networks for Glioblastoma Segmentation

Darvin Yi, Mu Zhou, Zhao Chen et al.

Convolutional Neural Networks (CNN) have emerged as powerful tools for learning discriminative image features. In this paper, we propose a framework of 3-D fully CNN models for Glioblastoma segmentation from multi-modality MRI data. By generalizing CNN models to true 3-D convolutions in learning 3-D tumor MRI data, the proposed approach utilizes a unique network architecture to decouple image pixels. Specifically, we design a convolutional layer with pre-defined Difference- of-Gaussian (DoG) filters to perform true 3-D convolution incorporating local neighborhood information at each pixel. We then use three trained convolutional layers that act to decouple voxels from the initial 3-D convolution. The proposed framework allows identification of high-level tumor structures on MRI. We evaluate segmentation performance on the BRATS segmentation dataset with 274 tumor samples. Extensive experimental results demonstrate encouraging performance of the proposed approach comparing to the state-of-the-art methods. Our data-driven approach achieves a median Dice score accuracy of 89% in whole tumor glioblastoma segmentation, revealing a generalized low-bias possibility to learn from medium-size MRI datasets.