Bart Elen

h-index13

6papers

1,300citations

Novelty39%

AI Score42

Ranked #58,614 of 194,257 authors (top 30%)#13,327 in LG (top 33%)

6 Papers

6.1CVMar 17Code

CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Mahmoud Ibrahim, Bart Elen, Chang Sun et al.

Generative models are increasingly used to augment medical imaging datasets for fairer AI. Yet a key assumption often goes unexamined: that generators themselves produce equally high-quality images across demographic groups. Models trained on imbalanced data can inherit these imbalances, yielding degraded synthesis quality for rare subgroups and struggling with demographic intersections absent from training. We refer to this as the imbalanced generator problem. Existing remedies such as loss reweighting operate at the optimization level and provide limited benefit when training signal is scarce or absent for certain combinations. We propose CompDiff, a hierarchical compositional diffusion framework that addresses this problem at the representation level. A dedicated Hierarchical Conditioner Network (HCN) decomposes demographic conditioning, producing a demographic token concatenated with CLIP embeddings as cross-attention context. This structured factorization encourages parameter sharing across subgroups and supports compositional generalization to rare or unseen demographic intersections. Experiments on chest X-rays (MIMIC-CXR) and fundus images (FairGenMed) show that CompDiff compares favorably against both standard fine-tuning and FairDiffusion across image quality (FID: 64.3 vs. 75.1), subgroup equity (ES-FID), and zero-shot intersectional generalization (up to 21% FID improvement on held-out intersections). Downstream classifiers trained on CompDiff-generated data also show improved AUROC and reduced demographic bias, suggesting that architectural design of demographic conditioning is an important and underexplored factor in fair medical image generation. Code is available at https://anonymous.4open.science/r/CompDiff-6FE6.

7.5AIJul 16

Demographically-Conditioned Synthetic Medical Images for Bias Mitigation and Bias Detection in Disease Classifiers

Mahmoud Ibrahim, Bart Elen, Chang Sun et al.

Per-subgroup fairness audits of medical image classifiers face a sample-size problem: minority subgroups in held-out test sets have so few samples that the resulting confidence intervals on per-subgroup performance are wider than the bias the audit is meant to detect. We argue that a demographically-conditioned synthetic generator can do both: mitigate bias on the training side and detect bias on the evaluation side. Working on COVID-19 chest CT classification with an end-to-end fine-tuned Stable Diffusion 2.1 generator, we make two findings. For bias mitigation (training), a demographically-balanced synthetic cohort is most useful as a pretraining prior, not as joint augmentation: with the same fixed data, sequential pretraining followed by fine-tuning substantially outperforms joint augmentation, and the resulting classifier surpasses the full-real baseline at $\sim$$100\times$ real-data efficiency. For bias detection (evaluation), across five synthetic minority cohorts and five classifier seeds, the synthetic estimator reproduces the subgroup ranking of a well-powered real oracle (Spearman $ρ= 1.00$ on MCC and Recall) and gives the more reliable per-cell estimate where the small real test set runs out of samples. The synthetic cohort is therefore most useful in exactly the cells that fairness audits care about, as both a fix for and a measure of subgroup bias.

4.1LGOct 22, 2025

Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series

Mahmoud Ibrahim, Bart Elen, Chang Sun et al.

We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70\%, achieving $Δ_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($Δ_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50\% relative to small real test sets, and outperform them in 72--84\% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.

21.1LGJun 27, 2024

Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges

Mahmoud Ibrahim, Yasmina Al Khalil, Sina Amirrajab et al.

This paper presents a comprehensive systematic review of generative models (GANs, VAEs, DMs, and LLMs) used to synthesize various medical data types, including imaging (dermoscopic, mammographic, ultrasound, CT, MRI, and X-ray), text, time-series, and tabular data (EHR). Unlike previous narrowly focused reviews, our study encompasses a broad array of medical data modalities and explores various generative models. Our search strategy queries databases such as Scopus, PubMed, and ArXiv, focusing on recent works from January 2021 to November 2023, excluding reviews and perspectives. This period emphasizes recent advancements beyond GANs, which have been extensively covered previously. The survey reveals insights from three key aspects: (1) Synthesis applications and purpose of synthesis, (2) generation techniques, and (3) evaluation methods. It highlights clinically valid synthesis applications, demonstrating the potential of synthetic data to tackle diverse clinical requirements. While conditional models incorporating class labels, segmentation masks and image translations are prevalent, there is a gap in utilizing prior clinical knowledge and patient-specific context, suggesting a need for more personalized synthesis approaches and emphasizing the importance of tailoring generative approaches to the unique characteristics of medical data. Additionally, there is a significant gap in using synthetic data beyond augmentation, such as for validation and evaluation of downstream medical AI models. The survey uncovers that the lack of standardized evaluation methodologies tailored to medical images is a barrier to clinical application, underscoring the need for in-depth evaluation approaches, benchmarking, and comparative studies to promote openness and collaboration.

2.4IVJun 7, 2021

Pointwise visual field estimation from optical coherence tomography in glaucoma: a structure-function analysis using deep learning

Ruben Hemelings, Bart Elen, João Barbosa Breda et al.

Background/Aims: Standard Automated Perimetry (SAP) is the gold standard to monitor visual field (VF) loss in glaucoma management, but is prone to intra-subject variability. We developed and validated a deep learning (DL) regression model that estimates pointwise and overall VF loss from unsegmented optical coherence tomography (OCT) scans. Methods: Eight DL regression models were trained with various retinal imaging modalities: circumpapillary OCT at 3.5mm, 4.1mm, 4.7mm diameter, and scanning laser ophthalmoscopy (SLO) en face images to estimate mean deviation (MD) and 52 threshold values. This retrospective study used data from patients who underwent a complete glaucoma examination, including a reliable Humphrey Field Analyzer (HFA) 24-2 SITA Standard VF exam and a SPECTRALIS OCT scan using the Glaucoma Module Premium Edition. Results: A total of 1378 matched OCT-VF pairs of 496 patients (863 eyes) were included for training and evaluation of the DL models. Average sample MD was -7.53dB (from -33.8dB to +2.0dB). For 52 VF threshold values estimation, the circumpapillary OCT scan with the largest radius (4.7mm) achieved the best performance among all individual models (Pearson r=0.77, 95% CI=[0.72-0.82]). For MD, prediction averaging of OCT-trained models (3.5mm, 4.1mm, 4.7mm) resulted in a Pearson r of 0.78 [0.73-0.83] on the validation set and comparable performance on the test set (Pearson r=0.79 [0.75-0.82]). Conclusion: DL on unsegmented OCT scans accurately predicts pointwise and mean deviation of 24-2 VF in glaucoma patients. Automated VF from unsegmented OCT could be a solution for patients unable to produce reliable perimetry results.

5.2IVJun 4, 2020

Pathological myopia classification with simultaneous lesion segmentation using deep learning

Ruben Hemelings, Bart Elen, Matthew B. Blaschko et al.

This investigation reports on the results of convolutional neural networks developed for the recently introduced PathologicAL Myopia (PALM) dataset, which consists of 1200 fundus images. We propose a new Optic Nerve Head (ONH)-based prediction enhancement for the segmentation of atrophy and fovea. Models trained with 400 available training images achieved an AUC of 0.9867 for pathological myopia classification, and a Euclidean distance of 58.27 pixels on the fovea localization task, evaluated on a test set of 400 images. Dice and F1 metrics for semantic segmentation of lesions scored 0.9303 and 0.9869 on optic disc, 0.8001 and 0.9135 on retinal atrophy, and 0.8073 and 0.7059 on retinal detachment, respectively. Our work was acknowledged with an award in the context of the "PathologicAL Myopia detection from retinal images" challenge held during the IEEE International Symposium on Biomedical Imaging (April 2019). Considering that (pathological) myopia cases are often identified as false positives and negatives in classification systems for glaucoma, we envision that the current work could aid in future research to discriminate between glaucomatous and highly-myopic eyes, complemented by the localization and segmentation of landmarks such as fovea, optic disc and atrophy.