CVJan 17, 2023Code
SegViz: A federated-learning based framework for multi-organ segmentation on heterogeneous data sets with partial annotationsAdway U. Kanhere, Pranav Kulkarni, Paul H. Yi et al.
Segmentation is one of the most primary tasks in deep learning for medical imaging, owing to its multiple downstream clinical applications. However, generating manual annotations for medical images is time-consuming, requires high skill, and is an expensive effort, especially for 3D images. One potential solution is to aggregate knowledge from partially annotated datasets from multiple groups to collaboratively train global models using Federated Learning. To this end, we propose SegViz, a federated learning-based framework to train a segmentation model from distributed non-i.i.d datasets with partial annotations. The performance of SegViz was compared against training individual models separately on each dataset as well as centrally aggregating all the datasets in one place and training a single model. The SegViz framework using FedBN as the aggregation strategy demonstrated excellent performance on the external BTCV set with dice scores of 0.93, 0.83, 0.55, and 0.75 for segmentation of liver, spleen, pancreas, and kidneys, respectively, significantly ($p<0.05$) better (except spleen) than the dice scores of 0.87, 0.83, 0.42, and 0.48 for the baseline models. In contrast, the central aggregation model significantly ($p<0.05$) performed poorly on the test dataset with dice scores of 0.65, 0, 0.55, and 0.68. Our results demonstrate the potential of the SegViz framework to train multi-task models from distributed datasets with partial labels. All our implementations are open-source and available at https://anonymous.4open.science/r/SegViz-B746
CVNov 29, 2022
Weakly Supervised Learning Significantly Reduces the Number of Labels Required for Intracranial Hemorrhage Detection on Head CTJacopo Teneggi, Paul H. Yi, Jeremias Sulam
Modern machine learning pipelines, in particular those based on deep learning (DL) models, require large amounts of labeled data. For classification problems, the most common learning paradigm consists of presenting labeled examples during training, thus providing strong supervision on what constitutes positive and negative samples. This constitutes a major obstacle for the development of DL models in radiology--in particular for cross-sectional imaging (e.g., computed tomography [CT] scans)--where labels must come from manual annotations by expert radiologists at the image or slice-level. These differ from examination-level annotations, which are coarser but cheaper, and could be extracted from radiology reports using natural language processing techniques. This work studies the question of what kind of labels should be collected for the problem of intracranial hemorrhage detection in brain CT. We investigate whether image-level annotations should be preferred to examination-level ones. By framing this task as a multiple instance learning problem, and employing modern attention-based DL architectures, we analyze the degree to which different levels of supervision improve detection performance. We find that strong supervision (i.e., learning with local image-level annotations) and weak supervision (i.e., learning with only global examination-level labels) achieve comparable performance in examination-level hemorrhage detection (the task of selecting the images in an examination that show signs of hemorrhage) as well as in image-level hemorrhage detection (highlighting those signs within the selected images). Furthermore, we study this behavior as a function of the number of labels available during training. Our results suggest that local labels may not be necessary at all for these tasks, drastically reducing the time and cost involved in collecting and curating datasets.
LGMar 10, 2023
Optimizing Federated Learning for Medical Image Classification on Distributed Non-iid Datasets with Partial LabelsPranav Kulkarni, Adway Kanhere, Paul H. Yi et al.
Numerous large-scale chest x-ray datasets have spearheaded expert-level detection of abnormalities using deep learning. However, these datasets focus on detecting a subset of disease labels that could be present, thus making them distributed and non-iid with partial labels. Recent literature has indicated the impact of batch normalization layers on the convergence of federated learning due to domain shift associated with non-iid data with partial labels. To that end, we propose FedFBN, a federated learning framework that draws inspiration from transfer learning by using pretrained networks as the model backend and freezing the batch normalization layers throughout the training process. We evaluate FedFBN with current FL strategies using synthetic iid toy datasets and large-scale non-iid datasets across scenarios with partial and complete labels. Our results demonstrate that FedFBN outperforms current aggregation strategies for training global models using distributed and non-iid data with partial labels.
IVNov 11, 2022
From Competition to Collaboration: Making Toy Datasets on Kaggle Clinically Useful for Chest X-Ray Diagnosis Using Federated LearningPranav Kulkarni, Adway Kanhere, Paul H. Yi et al.
Chest X-ray (CXR) datasets hosted on Kaggle, though useful from a data science competition standpoint, have limited utility in clinical use because of their narrow focus on diagnosing one specific disease. In real-world clinical use, multiple diseases need to be considered since they can co-exist in the same patient. In this work, we demonstrate how federated learning (FL) can be used to make these toy CXR datasets from Kaggle clinically useful. Specifically, we train a single FL classification model (`global`) using two separate CXR datasets -- one annotated for presence of pneumonia and the other for presence of pneumothorax (two common and life-threatening conditions) -- capable of diagnosing both. We compare the performance of the global FL model with models trained separately on both datasets (`baseline`) for two different model architectures. On a standard, naive 3-layer CNN architecture, the global FL model achieved AUROC of 0.84 and 0.81 for pneumonia and pneumothorax, respectively, compared to 0.85 and 0.82, respectively, for both baseline models (p>0.05). Similarly, on a pretrained DenseNet121 architecture, the global FL model achieved AUROC of 0.88 and 0.91 for pneumonia and pneumothorax, respectively, compared to 0.89 and 0.91, respectively, for both baseline models (p>0.05). Our results suggest that FL can be used to create global `meta` models to make toy datasets from Kaggle clinically useful, a step forward towards bridging the gap from bench to bedside.
33.6CLMar 13
MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician PreferencesEric Wu, Kevin Wu, Jason Hom et al.
Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.
CVJan 17, 2023
From Isolation to Collaboration: Federated Class-Heterogeneous Learning for Chest X-Ray ClassificationPranav Kulkarni, Adway Kanhere, Paul H. Yi et al.
Federated learning (FL) is a promising paradigm to collaboratively train a global chest x-ray (CXR) classification model using distributed datasets while preserving patient privacy. A significant, yet relatively underexplored, challenge in FL is class-heterogeneity, where clients have different sets of classes. We propose surgical aggregation, a FL method that uses selective aggregation to collaboratively train a global model using distributed, class-heterogeneous datasets. Unlike other methods, our method does not rely on the assumption that clients share the same classes as other clients, know the classes of other clients, or have access to a fully annotated dataset. We evaluate surgical aggregation using class-heterogeneous CXR datasets across IID and non-IID settings. Our results show that our method outperforms current methods and has better generalizability.
CVJul 1, 2023
Towards Resource-Efficient Streaming of Large-Scale Medical Image Datasets for Deep LearningPranav Kulkarni, Adway Kanhere, Eliot Siegel et al.
Large-scale medical imaging datasets have accelerated deep learning (DL) for medical image analysis. However, the large scale of these datasets poses a challenge for researchers, resulting in increased storage and bandwidth requirements for hosting and accessing them. Since different researchers have different use cases and require different resolutions or formats for DL, it is neither feasible to anticipate every researcher's needs nor practical to store data in multiple resolutions and formats. To that end, we propose the Medical Image Streaming Toolkit (MIST), a format-agnostic database that enables streaming of medical images at different resolutions and formats from a single high-resolution copy. We evaluated MIST across eight popular, large-scale medical imaging datasets spanning different body parts, modalities, and formats. Our results showed that our framework reduced the storage and bandwidth requirements for hosting and downloading datasets without impacting image quality. We demonstrate that MIST addresses the challenges posed by large-scale medical imaging datasets by building a data-efficient and format-agnostic database to meet the diverse needs of researchers and reduce barriers to DL research in medical imaging.
CVMar 22, 2024
Anytime, Anywhere, Anyone: Investigating the Feasibility of Segment Anything Model for Crowd-Sourcing Medical Image AnnotationsPranav Kulkarni, Adway Kanhere, Dharmam Savani et al.
Curating annotations for medical image segmentation is a labor-intensive and time-consuming task that requires domain expertise, resulting in "narrowly" focused deep learning (DL) models with limited translational utility. Recently, foundation models like the Segment Anything Model (SAM) have revolutionized semantic segmentation with exceptional zero-shot generalizability across various domains, including medical imaging, and hold a lot of promise for streamlining the annotation process. However, SAM has yet to be evaluated in a crowd-sourced setting to curate annotations for training 3D DL segmentation models. In this work, we explore the potential of SAM for crowd-sourcing "sparse" annotations from non-experts to generate "dense" segmentation masks for training 3D nnU-Net models, a state-of-the-art DL segmentation model. Our results indicate that while SAM-generated annotations exhibit high mean Dice scores compared to ground-truth annotations, nnU-Net models trained on SAM-generated annotations perform significantly worse than nnU-Net models trained on ground-truth annotations ($p<0.001$, all).
MLMar 4, 2025
Multiaccuracy and Multicalibration via Proxy GroupsBeepul Bharti, Mary Versa Clemens-Sewall, Paul H. Yi et al.
As the use of predictive machine learning algorithms increases in high-stakes decision-making, it is imperative that these algorithms are fair across sensitive groups. However, measuring and enforcing fairness in real-world applications can be challenging due to the missing or incomplete sensitive group information. Proxy-sensitive attributes have been proposed as a practical and effective solution in these settings, but only for parity-based fairness notions. Knowing how to evaluate and control for fairness with missing sensitive group data for newer, different, and more flexible frameworks, such as multiaccuracy and multicalibration, remain unexplored. In this work, we address this gap by demonstrating that in the absence of sensitive group data, proxy-sensitive attributes can provably used to derive actionable upper bounds on the true multiaccuracy and multicalibration violations, providing insights into a predictive model's potential worst-case fairness violations. Additionally, we show that adjusting models to satisfy multiaccuracy and multicalibration across proxy-sensitive attributes can significantly mitigate these violations for the true, but unknown, sensitive groups. Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group data is incomplete or unavailable.
AIFeb 12, 2024
Out-of-Distribution Detection and Data Drift Monitoring using Statistical Process ControlGhada Zamzmi, Kesavan Venkatesh, Brandon Nelson et al.
Background: Machine learning (ML) methods often fail with data that deviates from their training distribution. This is a significant concern for ML-enabled devices in clinical settings, where data drift may cause unexpected performance that jeopardizes patient safety. Method: We propose a ML-enabled Statistical Process Control (SPC) framework for out-of-distribution (OOD) detection and drift monitoring. SPC is advantageous as it visually and statistically highlights deviations from the expected distribution. To demonstrate the utility of the proposed framework for monitoring data drift in radiological images, we investigated different design choices, including methods for extracting feature representations, drift quantification, and SPC parameter selection. Results: We demonstrate the effectiveness of our framework for two tasks: 1) differentiating axial vs. non-axial computed tomography (CT) images and 2) separating chest x-ray (CXR) from other modalities. For both tasks, we achieved high accuracy in detecting OOD inputs, with 0.913 in CT and 0.995 in CXR, and sensitivity of 0.980 in CT and 0.984 in CXR. Our framework was also adept at monitoring data streams and identifying the time a drift occurred. In a simulation with 100 daily CXR cases, we detected a drift in OOD input percentage from 0-1% to 3-5% within two days, maintaining a low false-positive rate. Through additional experimental results, we demonstrate the framework's data-agnostic nature and independence from the underlying model's structure. Conclusion: We propose a framework for OOD detection and drift monitoring that is agnostic to data, modality, and model. The framework is customizable and can be adapted for specific applications.
LGDec 5, 2025
When Privacy Isn't Synthetic: Hidden Data Leakage in Generative AI ModelsS. M. Mustaqim, Anantaa Kotal, Paul H. Yi
Generative models are increasingly used to produce privacy-preserving synthetic data as a safe alternative to sharing sensitive training datasets. However, we demonstrate that such synthetic releases can still leak information about the underlying training samples through structural overlap in the data manifold. We propose a black-box membership inference attack that exploits this vulnerability without requiring access to model internals or real data. The attacker repeatedly queries the generative model to obtain large numbers of synthetic samples, performs unsupervised clustering to identify dense regions of the synthetic distribution, and then analyzes cluster medoids and neighborhoods that correspond to high-density regions in the original training data. These neighborhoods act as proxies for training samples, enabling the adversary to infer membership or reconstruct approximate records. Our experiments across healthcare, finance, and other sensitive domains show that cluster overlap between real and synthetic data leads to measurable membership leakage-even when the generator is trained with differential privacy or other noise mechanisms. The results highlight an under-explored attack surface in synthetic data generation pipelines and call for stronger privacy guarantees that account for distributional neighborhood inference rather than sample-level memorization alone, underscoring its role in privacy-preserving data publishing. Implementation and evaluation code are publicly available at:github.com/Cluster-Medoid-Leakage-Attack.
LGMay 21, 2025
Direct Preference Optimization for Adaptive Concept-based ExplanationsJacopo Teneggi, Zhenzhen Wang, Paul H. Yi et al.
Concept-based explanation methods aim at making machine learning models more transparent by finding the most important semantic features of an input (e.g., colors, patterns, shapes) for a given prediction task. However, these methods generally ignore the communicative context of explanations, such as the preferences of a listener. For example, medical doctors understand explanations in terms of clinical markers, but patients may not, needing a different vocabulary to rationalize the same diagnosis. We address this gap with listener-adaptive explanations grounded in principles of pragmatic reasoning and the rational speech act. We introduce an iterative training procedure based on direct preference optimization where a speaker learns to compose explanations that maximize communicative utility for a listener. Our approach only needs access to pairwise preferences, which can be collected from human feedback, making it particularly relevant in real-world scenarios where a model of the listener may not be available. We demonstrate that our method is able to align speakers with the preferences of simulated listeners on image classification across three datasets, and further validate that pragmatic explanations generated with our method improve the classification accuracy of participants in a user study.
CVFeb 13, 2025
Towards Virtual Clinical Trials of Radiology AI with Conditional Generative ModelingBenjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni et al.
Artificial intelligence (AI) is poised to transform healthcare by enabling personalized and efficient care through data-driven insights. Although radiology is at the forefront of AI adoption, in practice, the potential of AI models is often overshadowed by severe failures to generalize: AI models can have performance degradation of up to 20% when transitioning from controlled test environments to clinical use by radiologists. This mismatch raises concerns that radiologists will be misled by incorrect AI predictions in practice and/or grow to distrust AI, rendering these promising technologies practically ineffectual. Exhaustive clinical trials of AI models on abundant and diverse data is thus critical to anticipate AI model degradation when encountering varied data samples. Achieving these goals, however, is challenging due to the high costs of collecting diverse data samples and corresponding annotations. To overcome these limitations, we introduce a novel conditional generative AI model designed for virtual clinical trials (VCTs) of radiology AI, capable of realistically synthesizing full-body CT images of patients with specified attributes. By learning the joint distribution of images and anatomical structures, our model enables precise replication of real-world patient populations with unprecedented detail at this scale. We demonstrate meaningful evaluation of radiology AI models through VCTs powered by our synthetic CT study populations, revealing model degradation and facilitating algorithmic auditing for bias-inducing data attributes. Our generative AI approach to VCTs is a promising avenue towards a scalable solution to assess model robustness, mitigate biases, and safeguard patient care by enabling simpler testing and evaluation of AI models in any desired range of diverse patient populations.
CVApr 30, 2024
Expanding the Horizon: Enabling Hybrid Quantum Transfer Learning for Long-Tailed Chest X-Ray ClassificationSkylar Chan, Pranav Kulkarni, Paul H. Yi et al.
Quantum machine learning (QML) has the potential for improving the multi-label classification of rare, albeit critical, diseases in large-scale chest x-ray (CXR) datasets due to theoretical quantum advantages over classical machine learning (CML) in sample efficiency and generalizability. While prior literature has explored QML with CXRs, it has focused on binary classification tasks with small datasets due to limited access to quantum hardware and computationally expensive simulations. To that end, we implemented a Jax-based framework that enables the simulation of medium-sized qubit architectures with significant improvements in wall-clock time over current software offerings. We evaluated the performance of our Jax-based framework in terms of efficiency and performance for hybrid quantum transfer learning for long-tailed classification across 8, 14, and 19 disease labels using large-scale CXR datasets. The Jax-based framework resulted in up to a 58% and 95% speed-up compared to PyTorch and TensorFlow implementations, respectively. However, compared to CML, QML demonstrated slower convergence and an average AUROC of 0.70, 0.73, and 0.74 for the classification of 8, 14, and 19 CXR disease labels. In comparison, the CML models had an average AUROC of 0.77, 0.78, and 0.80 respectively. In conclusion, our work presents an accessible implementation of hybrid quantum transfer learning for long-tailed CXR classification with a computationally efficient Jax-based framework.
IVApr 10, 2024
Improving Multi-Center Generalizability of GAN-Based Fat Suppression using Federated LearningPranav Kulkarni, Adway Kanhere, Harshita Kukreja et al.
Generative Adversarial Network (GAN)-based synthesis of fat suppressed (FS) MRIs from non-FS proton density sequences has the potential to accelerate acquisition of knee MRIs. However, GANs trained on single-site data have poor generalizability to external data. We show that federated learning can improve multi-center generalizability of GANs for synthesizing FS MRIs, while facilitating privacy-preserving multi-institutional collaborations.
LGFeb 8, 2024
Hidden in Plain Sight: Undetectable Adversarial Bias Attacks on Vulnerable Patient PopulationsPranav Kulkarni, Andrew Chan, Nithya Navarathna et al.
The proliferation of artificial intelligence (AI) in radiology has shed light on the risk of deep learning (DL) models exacerbating clinical biases towards vulnerable patient populations. While prior literature has focused on quantifying biases exhibited by trained DL models, demographically targeted adversarial bias attacks on DL models and its implication in the clinical environment remains an underexplored field of research in medical imaging. In this work, we demonstrate that demographically targeted label poisoning attacks can introduce undetectable underdiagnosis bias in DL models. Our results across multiple performance metrics and demographic groups like sex, age, and their intersectional subgroups show that adversarial bias attacks demonstrate high-selectivity for bias in the targeted group by degrading group model performance without impacting overall model performance. Furthermore, our results indicate that adversarial bias attacks result in biased DL models that propagate prediction bias even when evaluated with external datasets.
IVMay 24, 2023
ISLE: An Intelligent Streaming Framework for High-Throughput AI Inference in Medical ImagingPranav Kulkarni, Sean Garin, Adway Kanhere et al.
As the adoption of Artificial Intelligence (AI) systems within the clinical environment grows, limitations in bandwidth and compute can create communication bottlenecks when streaming imaging data, leading to delays in patient care and increased cost. As such, healthcare providers and AI vendors will require greater computational infrastructure, therefore dramatically increasing costs. To that end, we developed ISLE, an intelligent streaming framework for high-throughput, compute- and bandwidth- optimized, and cost effective AI inference for clinical decision making at scale. In our experiments, ISLE on average reduced data transmission by 98.02% and decoding time by 98.09%, while increasing throughput by 2,730%. We show that ISLE results in faster turnaround times, and reduced overall cost of data, transmission, and compute, without negatively impacting clinical decision making using AI systems.
LGMay 12, 2023
Text2Cohort: Facilitating Intuitive Access to Biomedical Data with Natural Language Cohort DiscoveryPranav Kulkarni, Adway Kanhere, Paul H. Yi et al.
The Imaging Data Commons (IDC) is a cloud-based database that provides researchers with open access to cancer imaging data, with the goal of facilitating collaboration. However, cohort discovery within the IDC database has a significant technical learning curve. Recently, large language models (LLM) have demonstrated exceptional utility for natural language processing tasks. We developed Text2Cohort, a LLM-powered toolkit to facilitate user-friendly natural language cohort discovery in the IDC. Our method translates user input into IDC queries using grounding techniques and returns the query's response. We evaluate Text2Cohort on 50 natural language inputs, from information extraction to cohort discovery. Our toolkit successfully generated responses with an 88% accuracy and 0.94 F1 score. We demonstrate that Text2Cohort can enable researchers to discover and curate cohorts on IDC with high levels of accuracy using natural language in a more intuitive and user-friendly way.