Prateek Prasanna

CV
h-index52
52papers
3,192citations
Novelty53%
AI Score61

52 Papers

CVJul 20, 2022Code
Learning Topological Interactions for Multi-Class Medical Image Segmentation

Saumya Gupta, Xiaoling Hu, James Kaan et al.

Deep learning methods have achieved impressive performance for multi-class medical image segmentation. However, they are limited in their ability to encode topological interactions among different classes (e.g., containment and exclusion). These constraints naturally arise in biomedical images and can be crucial in improving segmentation quality. In this paper, we introduce a novel topological interaction module to encode the topological interactions into a deep neural network. The implementation is completely convolution-based and thus can be very efficient. This empowers us to incorporate the constraints into end-to-end training and enrich the feature representation of neural networks. The efficacy of the proposed method is validated on different types of interactions. We also demonstrate the generalizability of the method on both proprietary and public challenge datasets, in both 2D and 3D settings, as well as across different modalities such as CT and Ultrasound. Code is available at: https://github.com/TopoXLab/TopoInteraction

CVMar 21, 2023Code
Prompt-MIL: Boosting Multi-Instance Learning Schemes via Task-specific Prompt Tuning

Jingwei Zhang, Saarthak Kapse, Ke Ma et al.

Whole slide image (WSI) classification is a critical task in computational pathology, requiring the processing of gigapixel-sized images, which is challenging for current deep-learning methods. Current state of the art methods are based on multi-instance learning schemes (MIL), which usually rely on pretrained features to represent the instances. Due to the lack of task-specific annotated data, these features are either obtained from well-established backbones on natural images, or, more recently from self-supervised models pretrained on histopathology. However, both approaches yield task-agnostic features, resulting in performance loss compared to the appropriate task-related supervision, if available. In this paper, we show that when task-specific annotations are limited, we can inject such supervision into downstream task training, to reduce the gap between fully task-tuned and task agnostic features. We propose Prompt-MIL, an MIL framework that integrates prompts into WSI classification. Prompt-MIL adopts a prompt tuning mechanism, where only a small fraction of parameters calibrates the pretrained features to encode task-specific information, rather than the conventional full fine-tuning approaches. Extensive experiments on three WSI datasets, TCGA-BRCA, TCGA-CRC, and BRIGHT, demonstrate the superiority of Prompt-MIL over conventional MIL methods, achieving a relative improvement of 1.49%-4.03% in accuracy and 0.25%-8.97% in AUROC while using fewer than 0.3% additional parameters. Compared to conventional full fine-tuning approaches, we fine-tune less than 1.3% of the parameters, yet achieve a relative improvement of 1.29%-13.61% in accuracy and 3.22%-27.18% in AUROC and reduce GPU memory consumption by 38%-45% while training 21%-27% faster. Our code is available at https://github.com/cvlab-stonybrook/PromptMIL.

IVMar 10, 2022Code
Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation

Lei Zhou, Huidong Liu, Joseph Bae et al.

Masked Autoencoder (MAE) has recently been shown to be effective in pre-training Vision Transformers (ViT) for natural image analysis. By reconstructing full images from partially masked inputs, a ViT encoder aggregates contextual information to infer masked image regions. We believe that this context aggregation ability is particularly essential to the medical image domain where each anatomical structure is functionally and mechanically connected to other structures and regions. Because there is no ImageNet-scale medical image dataset for pre-training, we investigate a self pre-training paradigm with MAE for medical image analysis tasks. Our method pre-trains a ViT on the training set of the target data instead of another dataset. Thus, self pre-training can benefit more scenarios where pre-training data is hard to acquire. Our experimental results show that MAE self pre-training markedly improves diverse medical image tasks including chest X-ray disease classification, abdominal CT multi-organ segmentation, and MRI brain tumor segmentation. Code is available at https://github.com/cvlab-stonybrook/SelfMedMAE

LGApr 22, 2022
Federated Learning Enables Big Data for Rare Cancer Boundary Detection

Sarthak Pati, Ujjwal Baid, Brandon Edwards et al.

Although machine learning (ML) has shown promise in numerous domains, there are concerns about generalizability to out-of-sample data. This is currently addressed by centrally sharing ample, and importantly diverse, data from multiple sites. However, such centralization is challenging to scale (or even not feasible) due to various limitations. Federated ML (FL) provides an alternative to train accurate and generalizable ML models, by only sharing numerical model updates. Here we present findings from the largest FL study to-date, involving data from 71 healthcare institutions across 6 continents, to generate an automatic tumor boundary detector for the rare disease of glioblastoma, utilizing the largest dataset of such patients ever used in the literature (25,256 MRI scans from 6,314 patients). We demonstrate a 33% improvement over a publicly trained model to delineate the surgically targetable tumor, and 23% improvement over the tumor's entire extent. We anticipate our study to: 1) enable more studies in healthcare informed by large and diverse data, ensuring meaningful results for rare diseases and underrepresented populations, 2) facilitate further quantitative analyses for glioblastoma via performance optimization of our consensus model for eventual public release, and 3) demonstrate the effectiveness of FL at such scale and task complexity as a paradigm shift for multi-site collaborations, alleviating the need for data sharing.

CVJun 9, 2023Code
Topology-Aware Uncertainty for Image Segmentation

Saumya Gupta, Yikai Zhang, Xiaoling Hu et al.

Segmentation of curvilinear structures such as vasculature and road networks is challenging due to relatively weak signals and complex geometry/topology. To facilitate and accelerate large scale annotation, one has to adopt semi-automatic approaches such as proofreading by experts. In this work, we focus on uncertainty estimation for such tasks, so that highly uncertain, and thus error-prone structures can be identified for human annotators to verify. Unlike most existing works, which provide pixel-wise uncertainty maps, we stipulate it is crucial to estimate uncertainty in the units of topological structures, e.g., small pieces of connections and branches. To achieve this, we leverage tools from topological data analysis, specifically discrete Morse theory (DMT), to first capture the structures, and then reason about their uncertainties. To model the uncertainty, we (1) propose a joint prediction model that estimates the uncertainty of a structure while taking the neighboring structures into consideration (inter-structural uncertainty); (2) propose a novel Probabilistic DMT to model the inherent uncertainty within each structure (intra-structural uncertainty) by sampling its representations via a perturb-and-walk scheme. On various 2D and 3D datasets, our method produces better structure-wise uncertainty maps compared to existing works. Code available at https://github.com/Saumya-Gupta-26/struct-uncertainty

CVDec 23, 2022Code
Precise Location Matching Improves Dense Contrastive Learning in Digital Pathology

Jingwei Zhang, Saarthak Kapse, Ke Ma et al.

Dense prediction tasks such as segmentation and detection of pathological entities hold crucial clinical value in computational pathology workflows. However, obtaining dense annotations on large cohorts is usually tedious and expensive. Contrastive learning (CL) is thus often employed to leverage large volumes of unlabeled data to pre-train the backbone network. To boost CL for dense prediction, some studies have proposed variations of dense matching objectives in pre-training. However, our analysis shows that employing existing dense matching strategies on histopathology images enforces invariance among incorrect pairs of dense features and, thus, is imprecise. To address this, we propose a precise location-based matching mechanism that utilizes the overlapping information between geometric transformations to precisely match regions in two augmentations. Extensive experiments on two pretraining datasets (TCGA-BRCA, NCT-CRC-HE) and three downstream datasets (GlaS, CRAG, BCSS) highlight the superiority of our method in semantic and instance segmentation tasks. Our method outperforms previous dense matching methods by up to 7.2% in average precision for detection and 5.6% in average precision for instance segmentation tasks. Additionally, by using our matching mechanism in the three popular contrastive learning frameworks, MoCo-v2, VICRegL, and ConCL, the average precision in detection is improved by 0.7% to 5.2%, and the average precision in segmentation is improved by 0.7% to 4.0%, demonstrating generalizability. Our code is available at https://github.com/cvlab-stonybrook/PLM_SSL.

QMOct 5, 2022
Novel Radiomic Measurements of Tumor- Associated Vasculature Morphology on Clinical Imaging as a Biomarker of Treatment Response in Multiple Cancers

Nathaniel Braman, Prateek Prasanna, Kaustav Bera et al.

Purpose: Tumor-associated vasculature differs from healthy blood vessels by its chaotic architecture and twistedness, which promotes treatment resistance. Measurable differences in these attributes may help stratify patients by likely benefit of systemic therapy (e.g. chemotherapy). In this work, we present a new category of radiomic biomarkers called quantitative tumor-associated vasculature (QuanTAV) features, and demonstrate their ability to predict response and survival across multiple cancers, imaging modalities, and treatment regimens. Experimental Design: We segmented tumor vessels and computed mathematical measurements of twistedness and organization on routine pre-treatment radiology (CT or contrast-enhanced MRI) from 558 patients, who received one of four first-line chemotherapy-based therapeutic intervention strategies for breast (n=371) or non-small cell lung cancer (NSCLC, n=187). Results: Across 4 chemotherapy-based treatment strategies, classifiers of QuanTAV measurements significantly (p<.05) predicted response in held out testing cohorts alone (AUC=0.63-0.71) and increased AUC by 0.06-0.12 when added to models of significant clinical variables alone. QuanTAV risk scores were prognostic of recurrence free survival in treatment cohorts chemotherapy for breast cancer (p=0.002, HR=1.25, 95% CI 1.08-1.44, C-index=.66) and chemoradiation for NSCLC (p=0.039, HR=1.28, 95% CI 1.01-1.62, C-index=0.66). Categorical QuanTAV risk groups were independently prognostic among all treatment groups, including NSCLC patients receiving chemotherapy (p=0.034, HR=2.29, 95% CI 1.07-4.94, C-index=0.62). Conclusions: Across these domains, we observed an association of vascular morphology on radiology with treatment outcome. Our findings suggest the potential of tumor-associated vasculature shape and structure as a prognostic and predictive biomarker for multiple cancers and treatments.

IVJul 12, 2023
SAM-Path: A Segment Anything Model for Semantic Segmentation in Digital Pathology

Jingwei Zhang, Ke Ma, Saarthak Kapse et al.

Semantic segmentations of pathological entities have crucial clinical value in computational pathology workflows. Foundation models, such as the Segment Anything Model (SAM), have been recently proposed for universal use in segmentation tasks. SAM shows remarkable promise in instance segmentation on natural images. However, the applicability of SAM to computational pathology tasks is limited due to the following factors: (1) lack of comprehensive pathology datasets used in SAM training and (2) the design of SAM is not inherently optimized for semantic segmentation tasks. In this work, we adapt SAM for semantic segmentation by introducing trainable class prompts, followed by further enhancements through the incorporation of a pathology encoder, specifically a pathology foundation model. Our framework, SAM-Path enhances SAM's ability to conduct semantic segmentation in digital pathology without human input prompts. Through experiments on two public pathology datasets, the BCSS and the CRAG datasets, we demonstrate that the fine-tuning with trainable class prompts outperforms vanilla SAM with manual prompts and post-processing by 27.52% in Dice score and 71.63% in IOU. On these two datasets, the proposed additional pathology foundation model further achieves a relative improvement of 5.07% to 5.12% in Dice score and 4.50% to 8.48% in IOU.

CVJun 14, 2022
Surgical Phase Recognition in Laparoscopic Cholecystectomy

Yunfan Li, Vinayak Shenoy, Prateek Prasanna et al.

Automatic recognition of surgical phases in surgical videos is a fundamental task in surgical workflow analysis. In this report, we propose a Transformer-based method that utilizes calibrated confidence scores for a 2-stage inference pipeline, which dynamically switches between a baseline model and a separately trained transition model depending on the calibrated confidence level. Our method outperforms the baseline model on the Cholec80 dataset, and can be applied to a variety of action segmentation methods.

CVMar 11, 2023
Token Sparsification for Faster Medical Image Segmentation

Lei Zhou, Huidong Liu, Joseph Bae et al.

Can we use sparse tokens for dense prediction, e.g., segmentation? Although token sparsification has been applied to Vision Transformers (ViT) to accelerate classification, it is still unknown how to perform segmentation from sparse tokens. To this end, we reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline. We first empirically show that naively applying existing approaches from classification token pruning and masked image modeling (MIM) leads to failure and inefficient training caused by inappropriate sampling algorithms and the low quality of the restored dense features. In this paper, we propose Soft-topK Token Pruning (STP) and Multi-layer Token Assembly (MTA) to address these problems. In sparse encoding, STP predicts token importance scores with a lightweight sub-network and samples the topK tokens. The intractable topK gradients are approximated through a continuous perturbed score distribution. In token completion, MTA restores a full token sequence by assembling both sparse output tokens and pruned multi-layer intermediate ones. The last dense decoding stage is compatible with existing segmentation decoders, e.g., UNETR. Experiments show SCD pipelines equipped with STP and MTA are much faster than baselines without token pruning in both training (up to 120% higher throughput and inference up to 60.6% higher throughput) while maintaining segmentation quality.

CVApr 3, 2023
ViT-DAE: Transformer-driven Diffusion Autoencoder for Histopathology Image Analysis

Xuan Xu, Saarthak Kapse, Rajarsi Gupta et al.

Generative AI has received substantial attention in recent years due to its ability to synthesize data that closely resembles the original data source. While Generative Adversarial Networks (GANs) have provided innovative approaches for histopathological image analysis, they suffer from limitations such as mode collapse and overfitting in discriminator. Recently, Denoising Diffusion models have demonstrated promising results in computer vision. These models exhibit superior stability during training, better distribution coverage, and produce high-quality diverse images. Additionally, they display a high degree of resilience to noise and perturbations, making them well-suited for use in digital pathology, where images commonly contain artifacts and exhibit significant variations in staining. In this paper, we present a novel approach, namely ViT-DAE, which integrates vision transformers (ViT) and diffusion autoencoders for high-quality histopathology image synthesis. This marks the first time that ViT has been introduced to diffusion autoencoders in computational pathology, allowing the model to better capture the complex and intricate details of histopathology images. We demonstrate the effectiveness of ViT-DAE on three publicly available datasets. Our approach outperforms recent GAN-based and vanilla DAE methods in generating realistic images.

CVFeb 8, 2023
Enhancing Modality-Agnostic Representations via Meta-Learning for Brain Tumor Segmentation

Aishik Konwer, Xiaoling Hu, Joseph Bae et al.

In medical vision, different imaging modalities provide complementary information. However, in practice, not all modalities may be available during inference or even training. Previous approaches, e.g., knowledge distillation or image synthesis, often assume the availability of full modalities for all patients during training; this is unrealistic and impractical due to the variability in data collection across sites. We propose a novel approach to learn enhanced modality-agnostic representations by employing a meta-learning strategy in training, even when only limited full modality samples are available. Meta-learning enhances partial modality representations to full modality representations by meta-training on partial modality data and meta-testing on limited full modality samples. Additionally, we co-supervise this feature enrichment by introducing an auxiliary adversarial learning branch. More specifically, a missing modality detector is used as a discriminator to mimic the full modality setting. Our segmentation framework significantly outperforms state-of-the-art brain tumor segmentation techniques in missing modality scenarios.

IVMar 2, 2022
Temporal Context Matters: Enhancing Single Image Prediction with Disease Progression Representations

Aishik Konwer, Xuan Xu, Joseph Bae et al.

Clinical outcome or severity prediction from medical images has largely focused on learning representations from single-timepoint or snapshot scans. It has been shown that disease progression can be better characterized by temporal imaging. We therefore hypothesized that outcome predictions can be improved by utilizing the disease progression information from sequential images. We present a deep learning approach that leverages temporal progression information to improve clinical outcome predictions from single-timepoint images. In our method, a self-attention based Temporal Convolutional Network (TCN) is used to learn a representation that is most reflective of the disease trajectory. Meanwhile, a Vision Transformer is pretrained in a self-supervised fashion to extract features from single-timepoint images. The key contribution is to design a recalibration module that employs maximum mean discrepancy loss (MMD) to align distributions of the above two contextual representations. We train our system to predict clinical outcomes and severity grades from single-timepoint images. Experiments on chest and osteoarthritis radiography datasets demonstrate that our approach outperforms other state-of-the-art techniques.

CVSep 12, 2023
Attention De-sparsification Matters: Inducing Diversity in Digital Pathology Representation Learning

Saarthak Kapse, Srijan Das, Jingwei Zhang et al.

We propose DiRL, a Diversity-inducing Representation Learning technique for histopathology imaging. Self-supervised learning techniques, such as contrastive and non-contrastive approaches, have been shown to learn rich and effective representations of digitized tissue samples with limited pathologist supervision. Our analysis of vanilla SSL-pretrained models' attention distribution reveals an insightful observation: sparsity in attention, i.e, models tends to localize most of their attention to some prominent patterns in the image. Although attention sparsity can be beneficial in natural images due to these prominent patterns being the object of interest itself, this can be sub-optimal in digital pathology; this is because, unlike natural images, digital pathology scans are not object-centric, but rather a complex phenotype of various spatially intermixed biological components. Inadequate diversification of attention in these complex images could result in crucial information loss. To address this, we leverage cell segmentation to densely extract multiple histopathology-specific representations, and then propose a prior-guided dense pretext task for SSL, designed to match the multiple corresponding representations between the views. Through this, the model learns to attend to various components more closely and evenly, thus inducing adequate diversification in attention for capturing context rich representations. Through quantitative and qualitative analysis on multiple tasks across cancer types, we demonstrate the efficacy of our method and observe that the attention is more globally distributed.

CVMar 28, 2022
CD-Net: Histopathology Representation Learning using Pyramidal Context-Detail Network

Saarthak Kapse, Srijan Das, Prateek Prasanna

Extracting rich phenotype information, such as cell density and arrangement, from whole slide histology images (WSIs), requires analysis of large field of view, i.e more contexual information. This can be achieved through analyzing the digital slides at lower resolution. A potential drawback is missing out on details present at a higher resolution. To jointly leverage complementary information from multiple resolutions, we present a novel transformer based Pyramidal Context-Detail Network (CD-Net). CD-Net exploits the WSI pyramidal structure through co-training of proposed Context and Detail Modules, which operate on inputs from multiple resolutions. The residual connections between the modules enable the joint training paradigm while learning self-supervised representation for WSIs. The efficacy of CD-Net is demonstrated in classifying Lung Adenocarcinoma from Squamous cell carcinoma.

IVJul 21, 2023
Learning to Segment from Noisy Annotations: A Spatial Correction Approach

Jiachen Yao, Yikai Zhang, Songzhu Zheng et al.

Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly assume noisy labels in different pixels are \textit{i.i.d}. However, segmentation label noise usually has strong spatial correlation and has prominent bias in distribution. In this paper, we propose a novel Markov model for segmentation noisy annotations that encodes both spatial correlation and bias. Further, to mitigate such label noise, we propose a label correction method to recover true label progressively. We provide theoretical guarantees of the correctness of the proposed method. Experiments show that our approach outperforms current state-of-the-art methods on both synthetic and real-world noisy annotations.

CVSep 13, 2023
Automated Assessment of Critical View of Safety in Laparoscopic Cholecystectomy

Yunfan Li, Himanshu Gupta, Haibin Ling et al.

Cholecystectomy (gallbladder removal) is one of the most common procedures in the US, with more than 1.2M procedures annually. Compared with classical open cholecystectomy, laparoscopic cholecystectomy (LC) is associated with significantly shorter recovery period, and hence is the preferred method. However, LC is also associated with an increase in bile duct injuries (BDIs), resulting in significant morbidity and mortality. The primary cause of BDIs from LCs is misidentification of the cystic duct with the bile duct. Critical view of safety (CVS) is the most effective of safety protocols, which is said to be achieved during the surgery if certain criteria are met. However, due to suboptimal understanding and implementation of CVS, the BDI rates have remained stable over the last three decades. In this paper, we develop deep-learning techniques to automate the assessment of CVS in LCs. An innovative aspect of our research is on developing specialized learning techniques by incorporating domain knowledge to compensate for the limited training data available in practice. In particular, our CVS assessment process involves a fusion of two segmentation maps followed by an estimation of a certain region of interest based on anatomical structures close to the gallbladder, and then finally determination of each of the three CVS criteria via rule-based assessment of structural information. We achieved a gain of over 11.8% in mIoU on relevant classes with our two-stream semantic segmentation approach when compared to a single-model baseline, and 1.84% in mIoU with our proposed Sobel loss function when compared to a Transformer-based baseline model. For CVS criteria, we achieved up to 16% improvement and, for the overall CVS assessment, we achieved 5% improvement in balanced accuracy compared to DeepCVS under the same experiment settings.

23.7CVMay 1
Semantic Context-aware mOdality fUsion Transformer (SCOUT): A Context-Aware Multimodal Transformer for Concept-Grounded Pathology Report Generation

Suryakant Singh, Saarthak Kapse, Joel Saltz et al.

Whole-slide images (WSIs) present a fundamental challenge for computational pathology due to their extreme resolution, multi-scale heterogeneity, and the requirement for clinically reliable interpretation. Although recent pathology foundation models have enabled fluent report generation, they often lack clinical grounding, failing to accurately represent key diagnostic concepts and relationships observed by pathologists. This limitation arises from the difficulty of integrating heterogeneous visual evidence spanning fine-grained cellular patterns, slide-level tissue architecture, and high-level diagnostic concepts, while maintaining interpretability and clinical coherence. Here we present SCOUT: Semantic Context-aware mOdality fUsion Transformer, a context-aware concept-grounded multimodal framework for pathology report generation that enables progressive conditioning of image representations by global slide information and explicit diagnostic concepts. The method integrates local histological patterns, whole-slide context, and expert-curated semantic descriptors within a unified learning paradigm, allowing visual features to be dynamically refined throughout the encoding process. By combining depth-aware contextual modulation with adaptive multimodal fusion during text generation, the framework produces clinically coherent reports while preserving complementarity across representational scales. Using CONCH1.5 features, we evaluate SCOUT against WSI-Caption, HistGen, and BiGen on TCGA-BRCA, MICCAI REG, and HistAI. SCOUT achieves the best BLEU-1 to BLEU-4 and METEOR scores on all datasets, plus the best ROUGE-L on TCGA-BRCA and MICCAI REG. On TCGA-BRCA, it reaches 0.436/0.303/0.202/0.156 BLEU-1/2/3/4 and 0.204 METEOR; on REG 2025, it achieves 0.865/0.834/0.805/0.780 and 0.568. These results support progressive contextual conditioning for grounded pathology report generation.

IVAug 27, 2024
Histo-Diffusion: A Diffusion Super-Resolution Method for Digital Pathology with Comprehensive Quality Assessment

Xuan Xu, Saarthak Kapse, Prateek Prasanna

Digital pathology has advanced significantly over the last decade, with Whole Slide Images (WSIs) encompassing vast amounts of data essential for accurate disease diagnosis. High-resolution WSIs are essential for precise diagnosis but technical limitations in scanning equipment and variablity in slide preparation can hinder obtaining these images. Super-resolution techniques can enhance low-resolution images; while Generative Adversarial Networks (GANs) have been effective in natural image super-resolution tasks, they often struggle with histopathology due to overfitting and mode collapse. Traditional evaluation metrics fall short in assessing the complex characteristics of histopathology images, necessitating robust histology-specific evaluation methods. We introduce Histo-Diffusion, a novel diffusion-based method specially designed for generating and evaluating super-resolution images in digital pathology. It includes a restoration module for histopathology prior and a controllable diffusion module for generating high-quality images. We have curated two histopathology datasets and proposed a comprehensive evaluation strategy which incorporates both full-reference and no-reference metrics to thoroughly assess the quality of digital pathology images. Comparative analyses on multiple datasets with state-of-the-art methods reveal that Histo-Diffusion outperforms GANs. Our method offers a versatile solution for histopathology image super-resolution, capable of handling multi-resolution generation from varied input sizes, providing valuable support in diagnostic processes.

CVDec 24, 2025
TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Varun Belagali, Saarthak Kapse, Pierre Marza et al.

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

IVDec 8, 2024Code
TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model

Meilong Xu, Saumya Gupta, Xiaoling Hu et al.

Accurately modeling multi-class cell topology is crucial in digital pathology, as it provides critical insights into tissue structure and pathology. The synthetic generation of cell topology enables realistic simulations of complex tissue environments, enhances downstream tasks by augmenting training data, aligns more closely with pathologists' domain knowledge, and offers new opportunities for controlling and generalizing the tumor microenvironment. In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. Our method refines the simulation of cell distributions and interactions, increasing the precision and interpretability of results in downstream tasks such as cell detection and classification. To assess the topological fidelity of generated layouts, we introduce a new metric, Topological Frechet Distance (TopoFD), which overcomes the limitations of traditional metrics like FID in evaluating topological structure. Experimental results demonstrate the effectiveness of our approach in generating multi-class cell layouts that capture intricate topological relationships. Code is available at https://github.com/Melon-Xu/TopoCellGen.

CVApr 1, 2025Code
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology

Saarthak Kapse, Pushpak Pati, Srikar Yellapragada et al.

Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior into a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise. Code is made available at https://github.com/bmi-imaginelab/GECKO

CVAug 13, 2025Code
GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs

Moinak Bhattacharya, Gagandeep Singh, Shubham Jain et al.

In this work, we present GazeLT, a human visual attention integration-disintegration approach for long-tailed disease classification. A radiologist's eye gaze has distinct patterns that capture both fine-grained and coarser level disease related information. While interpreting an image, a radiologist's attention varies throughout the duration; it is critical to incorporate this into a deep learning framework to improve automated image interpretation. Another important aspect of visual attention is that apart from looking at major/obvious disease patterns, experts also look at minor/incidental findings (few of these constituting long-tailed classes) during the course of image interpretation. GazeLT harnesses the temporal aspect of the visual search process, via an integration and disintegration mechanism, to improve long-tailed disease classification. We show the efficacy of GazeLT on two publicly available datasets for long-tailed disease classification, namely the NIH-CXR-LT (n=89237) and the MIMIC-CXR-LT (n=111898) datasets. GazeLT outperforms the best long-tailed loss by 4.1% and the visual attention-based baseline by 21.7% in average accuracy metrics for these datasets. Our code is available at https://github.com/lordmoinak1/gazelt.

IVMar 14, 2025Code
Pathology Image Compression with Pre-trained Autoencoders

Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis et al.

The growing volume of high-resolution Whole Slide Images in digital histopathology poses significant storage, transmission, and computational efficiency challenges. Standard compression methods, such as JPEG, reduce file sizes but often fail to preserve fine-grained phenotypic details critical for downstream tasks. In this work, we repurpose autoencoders (AEs) designed for Latent Diffusion Models as an efficient learned compression framework for pathology images. We systematically benchmark three AE models with varying compression levels and evaluate their reconstruction ability using pathology foundation models. We introduce a fine-tuning strategy to further enhance reconstruction fidelity that optimizes a pathology-specific learned perceptual metric. We validate our approach on downstream tasks, including segmentation, patch classification, and multiple instance learning, showing that replacing images with AE-compressed reconstructions leads to minimal performance degradation. Additionally, we propose a K-means clustering-based quantization method for AE latents, improving storage efficiency while maintaining reconstruction quality. We provide the weights of the fine-tuned autoencoders at https://huggingface.co/collections/StonyBrook-CVLab/pathology-fine-tuned-aes-67d45f223a659ff2e3402dd0.

CVFeb 23, 2022Code
RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification

Moinak Bhattacharya, Shubham Jain, Prateek Prasanna

In this work, we present RadioTransformer, a novel visual attention-driven transformer framework, that leverages radiologists' gaze patterns and models their visuo-cognitive behavior for disease diagnosis on chest radiographs. Domain experts, such as radiologists, rely on visual information for medical image interpretation. On the other hand, deep neural networks have demonstrated significant promise in similar tasks even where visual interpretation is challenging. Eye-gaze tracking has been used to capture the viewing behavior of domain experts, lending insights into the complexity of visual search. However, deep learning frameworks, even those that rely on attention mechanisms, do not leverage this rich domain information. RadioTransformer fills this critical gap by learning from radiologists' visual search patterns, encoded as 'human visual attention regions' in a cascaded global-focal transformer framework. The overall 'global' image characteristics and the more detailed 'local' features are captured by the proposed global and focal modules, respectively. We experimentally validate the efficacy of our student-teacher approach for 8 datasets involving different disease classification tasks where eye-gaze data is not available during the inference phase. Code: https://github.com/bmi-imaginelab/radiotransformer.

CVDec 12, 2023
Learned representation-guided diffusion models for large-image generation

Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le et al.

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

CVDec 22, 2023
SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology

Saarthak Kapse, Pushpak Pati, Srijan Das et al.

Introducing interpretability and reasoning into Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) analysis is challenging, given the complexity of gigapixel slides. Traditionally, MIL interpretability is limited to identifying salient regions deemed pertinent for downstream tasks, offering little insight to the end-user (pathologist) regarding the rationale behind these selections. To address this, we propose Self-Interpretable MIL (SI-MIL), a method intrinsically designed for interpretability from the very outset. SI-MIL employs a deep MIL framework to guide an interpretable branch grounded on handcrafted pathological features, facilitating linear predictions. Beyond identifying salient regions, SI-MIL uniquely provides feature-level interpretations rooted in pathological insights for WSIs. Notably, SI-MIL, with its linear prediction constraints, challenges the prevalent myth of an inevitable trade-off between model interpretability and performance, demonstrating competitive results compared to state-of-the-art methods on WSI-level prediction tasks across three cancer types. In addition, we thoroughly benchmark the local and global-interpretability of SI-MIL in terms of statistical analysis, a domain expert study, and desiderata of interpretability, namely, user-friendliness and faithfulness.

IVNov 5, 2024
TopoTxR: A topology-guided deep convolutional network for breast parenchyma learning on DCE-MRIs

Fan Wang, Zhilin Zou, Nicole Sakla et al.

Characterization of breast parenchyma in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is a challenging task owing to the complexity of underlying tissue structures. Existing quantitative approaches, like radiomics and deep learning models, lack explicit quantification of intricate and subtle parenchymal structures, including fibroglandular tissue. To address this, we propose a novel topological approach that explicitly extracts multi-scale topological structures to better approximate breast parenchymal structures, and then incorporates these structures into a deep-learning-based prediction model via an attention mechanism. Our topology-informed deep learning model, \emph{TopoTxR}, leverages topology to provide enhanced insights into tissues critical for disease pathophysiology and treatment response. We empirically validate \emph{TopoTxR} using the VICTRE phantom breast dataset, showing that the topological structures extracted by our model effectively approximate the breast parenchymal structures. We further demonstrate \emph{TopoTxR}'s efficacy in predicting response to neoadjuvant chemotherapy. Our qualitative and quantitative analyses suggest differential topological behavior of breast tissue in treatment-naïve imaging, in patients who respond favorably to therapy as achieving pathological complete response (pCR) versus those who do not. In a comparative analysis with several baselines on the publicly available I-SPY 1 dataset (N=161, including 47 patients with pCR and 114 without) and the Rutgers proprietary dataset (N=120, with 69 patients achieving pCR and 51 not), \emph{TopoTxR} demonstrates a notable improvement, achieving a 2.6\% increase in accuracy and a 4.6\% enhancement in AUC compared to the state-of-the-art method.

CVNov 25, 2024
ZoomLDM: Latent Diffusion Model for multi-scale image generation

Srikar Yellapragada, Alexandros Graikos, Kostas Triaridis et al.

Diffusion models have revolutionized image generation, yet several challenges restrict their application to large-image domains, such as digital pathology and satellite imagery. Given that it is infeasible to directly train a model on 'whole' images from domains with potential gigapixel sizes, diffusion-based generative methods have focused on synthesizing small, fixed-size patches extracted from these images. However, generating small patches has limited applicability since patch-based models fail to capture the global structures and wider context of large images, which can be crucial for synthesizing (semantically) accurate samples. To overcome this limitation, we present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings and allows the diffusion model to synthesize images at different 'zoom' levels, i.e., fixed-size patches extracted from large images at varying scales. ZoomLDM synthesizes coherent histopathology images that remain contextually accurate and detailed at different zoom levels, achieving state-of-the-art image generation quality across all scales and excelling in the data-scarce setting of generating thumbnails of entire large images. The multi-scale nature of ZoomLDM unlocks additional capabilities in large image generation, enabling computationally tractable and globally coherent image synthesis up to $4096 \times 4096$ pixels and $4\times$ super-resolution. Additionally, multi-scale features extracted from ZoomLDM are highly effective in multiple instance learning experiments.

CVMar 6, 2025
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation

Aishik Konwer, Zhijian Yang, Erhan Bas et al.

Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate continuous human involvement and complex domain knowledge for label refinement or establishing reward ground truth. To address these challenges, we propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion, while still capturing essential semantic, location, and shape information through contrastive language-image pretraining and visual question answering. We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations with simple ratings or rankings provided by a virtual annotator simulating the human annotation process. State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.

IVMay 29, 2025
ImmunoDiff: A Diffusion Model for Immunotherapy Response Prediction in Lung Cancer

Moinak Bhattacharya, Judy Huang, Amna F. Sher et al.

Accurately predicting immunotherapy response in Non-Small Cell Lung Cancer (NSCLC) remains a critical unmet need. Existing radiomics and deep learning-based predictive models rely primarily on pre-treatment imaging to predict categorical response outcomes, limiting their ability to capture the complex morphological and textural transformations induced by immunotherapy. This study introduces ImmunoDiff, an anatomy-aware diffusion model designed to synthesize post-treatment CT scans from baseline imaging while incorporating clinically relevant constraints. The proposed framework integrates anatomical priors, specifically lobar and vascular structures, to enhance fidelity in CT synthesis. Additionally, we introduce a novel cbi-Adapter, a conditioning module that ensures pairwise-consistent multimodal integration of imaging and clinical data embeddings, to refine the generative process. Additionally, a clinical variable conditioning mechanism is introduced, leveraging demographic data, blood-based biomarkers, and PD-L1 expression to refine the generative process. Evaluations on an in-house NSCLC cohort treated with immune checkpoint inhibitors demonstrate a 21.24% improvement in balanced accuracy for response prediction and a 0.03 increase in c-index for survival prediction. Code will be released soon.

CVDec 2, 2024
Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Varun Belagali, Srikar Yellapragada, Alexandros Graikos et al.

Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

IVApr 6, 2025
BrainMRDiff: A Diffusion Model for Anatomically Consistent Brain MRI Synthesis

Moinak Bhattacharya, Saumya Gupta, Annie Singh et al.

Accurate brain tumor diagnosis relies on the assessment of multiple Magnetic Resonance Imaging (MRI) sequences. However, in clinical practice, the acquisition of certain sequences may be affected by factors like motion artifacts or contrast agent contraindications, leading to suboptimal outcome, such as poor image quality. This can then affect image interpretation by radiologists. Synthesizing high quality MRI sequences has thus become a critical research focus. Though recent advancements in controllable generative AI have facilitated the synthesis of diagnostic quality MRI, ensuring anatomical accuracy remains a significant challenge. Preserving critical structural relationships between different anatomical regions is essential, as even minor structural or topological inconsistencies can compromise diagnostic validity. In this work, we propose BrainMRDiff, a novel topology-preserving, anatomy-guided diffusion model for synthesizing brain MRI, leveraging brain and tumor anatomies as conditioning inputs. To achieve this, we introduce two key modules: Tumor+Structure Aggregation (TSA) and Topology-Guided Anatomy Preservation (TGAP). TSA integrates diverse anatomical structures with tumor information, forming a comprehensive conditioning mechanism for the diffusion process. TGAP enforces topological consistency during reverse denoising diffusion process; both these modules ensure that the generated image respects anatomical integrity. Experimental results demonstrate that BrainMRDiff surpasses existing baselines, achieving performance improvements of 23.33% on the BraTS-AG dataset and 33.33% on the BraTS-Met dataset. Code will be made publicly available soon.

IVNov 22, 2024
RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

Wentao Huang, Meilong Xu, Xiaoling Hu et al.

Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.

CVMar 4
Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

Xuan Xu, Prateek Prasanna

Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.

CVOct 3, 2025
PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology

Sejuti Majumder, Saarthak Kapse, Moinak Bhattacharya et al.

Integrating histopathology with spatial transcriptomics (ST) provides a powerful opportunity to link tissue morphology with molecular function. Yet most existing multimodal approaches rely on a small set of highly variable genes, which limits predictive scope and overlooks the coordinated biological programs that shape tissue phenotypes. We present PEaRL (Pathway Enhanced Representation Learning), a multimodal framework that represents transcriptomics through pathway activation scores computed with ssGSEA. By encoding biologically coherent pathway signals with a transformer and aligning them with histology features via contrastive learning, PEaRL reduces dimensionality, improves interpretability, and strengthens cross-modal correspondence. Across three cancer ST datasets (breast, skin, and lymph node), PEaRL consistently outperforms SOTA methods, yielding higher accuracy for both gene- and pathway-level expression prediction (up to 58.9 percent and 20.4 percent increase in Pearson correlation coefficient compared to SOTA). These results demonstrate that grounding transcriptomic representation in pathways produces more biologically faithful and interpretable multimodal models, advancing computational pathology beyond gene-level embeddings.

IVSep 29, 2025
Anatomy-DT: A Cross-Diffusion Digital Twin for Anatomical Evolution

Moinak Bhattacharya, Gagandeep Singh, Prateek Prasanna

Accurately modeling the spatiotemporal evolution of tumor morphology from baseline imaging is a pre-requisite for developing digital twin frameworks that can simulate disease progression and treatment response. Most existing approaches primarily characterize tumor growth while neglecting the concomitant alterations in adjacent anatomical structures. In reality, tumor evolution is highly non-linear and heterogeneous, shaped not only by therapeutic interventions but also by its spatial context and interaction with neighboring tissues. Therefore, it is critical to model tumor progression in conjunction with surrounding anatomy to obtain a comprehensive and clinically relevant understanding of disease dynamics. We introduce a mathematically grounded framework that unites mechanistic partial differential equations with differentiable deep learning. Anatomy is represented as a multi-class probability field on the simplex and evolved by a cross-diffusion reaction-diffusion system that enforces inter-class competition and exclusivity. A differentiable implicit-explicit scheme treats stiff diffusion implicitly while handling nonlinear reaction and event terms explicitly, followed by projection back to the simplex. To further enhance global plausibility, we introduce a topology regularizer that simultaneously enforces centerline preservation and penalizes region overlaps. The approach is validated on synthetic datasets and a clinical dataset. On synthetic benchmarks, our method achieves state-of-the-art accuracy while preserving topology, and also demonstrates superior performance on the clinical dataset. By integrating PDE dynamics, topology-aware regularization, and differentiable solvers, this work establishes a principled path toward anatomy-to-anatomy generation for digital twins that are visually realistic, anatomically exclusive, and topologically consistent.

CVSep 29, 2025
SoC-DT: Standard-of-Care Aligned Digital Twins for Patient-Specific Tumor Dynamics

Moinak Bhattacharya, Gagandeep Singh, Prateek Prasanna

Accurate prediction of tumor trajectories under standard-of-care (SoC) therapies remains a major unmet need in oncology. This capability is essential for optimizing treatment planning and anticipating disease progression. Conventional reaction-diffusion models are limited in scope, as they fail to capture tumor dynamics under heterogeneous therapeutic paradigms. There is hence a critical need for computational frameworks that can realistically simulate SoC interventions while accounting for inter-patient variability in genomics, demographics, and treatment regimens. We introduce Standard-of-Care Digital Twin (SoC-DT), a differentiable framework that unifies reaction-diffusion tumor growth models, discrete SoC interventions (surgery, chemotherapy, radiotherapy) along with genomic and demographic personalization to predict post-treatment tumor structure on imaging. An implicit-explicit exponential time-differencing solver, IMEX-SoC, is also proposed, which ensures stability, positivity, and scalability in SoC treatment situations. Evaluated on both synthetic data and real world glioma data, SoC-DT consistently outperforms classical PDE baselines and purely data-driven neural models in predicting tumor dynamics. By bridging mechanistic interpretability with modern differentiable solvers, SoC-DT establishes a principled foundation for patient-specific digital twins in oncology, enabling biologically consistent tumor dynamics estimation. Code will be made available upon acceptance.

CVSep 18, 2025
NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training

Moinak Bhattacharya, Angelica P. Kurtz, Fabio M. Iwamoto et al.

Neuro-oncology poses unique challenges for machine learning due to heterogeneous data and tumor complexity, limiting the ability of foundation models (FMs) to generalize across cohorts. Existing FMs also perform poorly in predicting uncommon molecular markers, which are essential for treatment response and risk stratification. To address these gaps, we developed a neuro-oncology specific FM with a distributionally robust loss function, enabling accurate estimation of tumor phenotypes while maintaining cross-institution generalization. We pretrained self-supervised backbones (BYOL, DINO, MAE, MoCo) on multi-institutional brain tumor MRI and applied distributionally robust optimization (DRO) to mitigate site and class imbalance. Downstream tasks included molecular classification of common markers (MGMT, IDH1, 1p/19q, EGFR), uncommon alterations (ATRX, TP53, CDKN2A/2B, TERT), continuous markers (Ki-67, TP53), and overall survival prediction in IDH1 wild-type glioblastoma at UCSF, UPenn, and CUIMC. Our method improved molecular prediction and reduced site-specific embedding differences. At CUIMC, mean balanced accuracy rose from 0.744 to 0.785 and AUC from 0.656 to 0.676, with the largest gains for underrepresented endpoints (CDKN2A/2B accuracy 0.86 to 0.92, AUC 0.73 to 0.92; ATRX AUC 0.69 to 0.82; Ki-67 accuracy 0.60 to 0.69). For survival, c-index improved at all sites: CUIMC 0.592 to 0.597, UPenn 0.647 to 0.672, UCSF 0.600 to 0.627. Grad-CAM highlighted tumor and peri-tumoral regions, confirming interpretability. Overall, coupling FMs with DRO yields more site-invariant representations, improves prediction of common and uncommon markers, and enhances survival discrimination, underscoring the need for prospective validation and integration of longitudinal and interventional signals to advance precision neuro-oncology.

IVJun 5, 2025
PixCell: A generative foundation model for digital histopathology images

Srikar Yellapragada, Alexandros Graikos, Zilinghan Li et al.

The digitization of histology slides has revolutionized pathology, providing massive datasets for cancer diagnosis and research. Contrastive self-supervised and vision-language models have been shown to effectively mine large pathology datasets to learn discriminative representations. On the other hand, generative models, capable of synthesizing realistic and diverse images, present a compelling solution to address unique problems in pathology that involve synthesizing images; overcoming annotated data scarcity, enabling privacy-preserving data sharing, and performing inherently generative tasks, such as virtual staining. We introduce PixCell, the first diffusion-based generative foundation model for histopathology. We train PixCell on PanCan-30M, a vast, diverse dataset derived from 69,184 H\&E-stained whole slide images covering various cancer types. We employ a progressive training strategy and a self-supervision-based conditioning that allows us to scale up training without any annotated data. PixCell generates diverse and high-quality images across multiple cancer types, which we find can be used in place of real data to train a self-supervised discriminative model. Synthetic images shared between institutions are subject to fewer regulatory barriers than would be the case with real clinical images. Furthermore, we showcase the ability to precisely control image generation using a small set of annotated images, which can be used for both data augmentation and educational purposes. Testing on a cell segmentation task, a mask-guided PixCell enables targeted data augmentation, improving downstream performance. Finally, we demonstrate PixCell's ability to use H\&E structural staining to infer results from molecular marker studies; we use this capability to infer IHC staining from H\&E images. Our trained models are publicly released to accelerate research in computational pathology.

CVSep 1, 2023
PathLDM: Text conditioned Latent Diffusion Model for Histopathology

Srikar Yellapragada, Alexandros Graikos, Prateek Prasanna et al.

To achieve high-quality results, diffusion models must be trained on large datasets. This can be notably prohibitive for models in specialized domains, such as computational pathology. Conditioning on labeled data is known to help in data-efficient model training. Therefore, histopathology reports, which are rich in valuable clinical information, are an ideal choice as guidance for a histopathology generative model. In this paper, we introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Leveraging the rich contextual information provided by pathology text reports, our approach fuses image and textual data to enhance the generation process. By utilizing GPT's capabilities to distill and summarize complex text reports, we establish an effective conditioning mechanism. Through strategic conditioning and necessary architectural enhancements, we achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.

IVFeb 3, 2022
Brain Cancer Survival Prediction on Treatment-na ive MRI using Deep Anchor Attention Learning with Vision Transformer

Xuan Xu, Prateek Prasanna

Image-based brain cancer prediction models, based on radiomics, quantify the radiologic phenotype from magnetic resonance imaging (MRI). However, these features are difficult to reproduce because of variability in acquisition and preprocessing pipelines. Despite evidence of intra-tumor phenotypic heterogeneity, the spatial diversity between different slices within an MRI scan has been relatively unexplored using such methods. In this work, we propose a deep anchor attention aggregation strategy with a Vision Transformer to predict survival risk for brain cancer patients. A Deep Anchor Attention Learning (DAAL) algorithm is proposed to assign different weights to slice-level representations with trainable distance measurements. We evaluated our method on N = 326 MRIs. Our results outperformed attention multiple instance learning-based techniques. DAAL highlights the importance of critical slices and corroborates the clinical intuition that inter-slice spatial diversity can reflect disease severity and is implicated in outcome.

IVJan 18, 2022
Lung Swapping Autoencoder: Learning a Disentangled Structure-texture Representation of Chest Radiographs

Lei Zhou, Joseph Bae, Huidong Liu et al.

Well-labeled datasets of chest radiographs (CXRs) are difficult to acquire due to the high cost of annotation. Thus, it is desirable to learn a robust and transferable representation in an unsupervised manner to benefit tasks that lack labeled data. Unlike natural images, medical images have their own domain prior; e.g., we observe that many pulmonary diseases, such as the COVID-19, manifest as changes in the lung tissue texture rather than the anatomical structure. Therefore, we hypothesize that studying only the texture without the influence of structure variations would be advantageous for downstream prognostic and predictive modeling tasks. In this paper, we propose a generative framework, the Lung Swapping Autoencoder (LSAE), that learns factorized representations of a CXR to disentangle the texture factor from the structure factor. Specifically, by adversarial training, the LSAE is optimized to generate a hybrid image that preserves the lung shape in one image but inherits the lung texture of another. To demonstrate the effectiveness of the disentangled texture representation, we evaluate the texture encoder $Enc^t$ in LSAE on ChestX-ray14 (N=112,120), and our own multi-institutional COVID-19 outcome prediction dataset, COVOC (N=340 (Subset-1) + 53 (Subset-2)). On both datasets, we reach or surpass the state-of-the-art by finetuning $Enc^t$ in LSAE that is 77% smaller than a baseline Inception v3. Additionally, in semi-and-self supervised settings with a similar model budget, $Enc^t$ in LSAE is also competitive with the state-of-the-art MoCo. By "re-mixing" the texture and shape factors, we generate meaningful hybrid images that can augment the training set. This data augmentation method can further improve COVOC prediction performance. The improvement is consistent even when we directly evaluate the Subset-1 trained model on Subset-2 without any fine-tuning.

IVJul 18, 2021
Attention-based Multi-scale Gated Recurrent Encoder with Novel Correlation Loss for COVID-19 Progression Prediction

Aishik Konwer, Joseph Bae, Gagandeep Singh et al.

COVID-19 image analysis has mostly focused on diagnostic tasks using single timepoint scans acquired upon disease presentation or admission. We present a deep learning-based approach to predict lung infiltrate progression from serial chest radiographs (CXRs) of COVID-19 patients. Our method first utilizes convolutional neural networks (CNNs) for feature extraction from patches within the concerned lung zone, and also from neighboring and remote boundary regions. The framework further incorporates a multi-scale Gated Recurrent Unit (GRU) with a correlation module for effective predictions. The GRU accepts CNN feature vectors from three different areas as input and generates a fused representation. The correlation module attempts to minimize the correlation loss between hidden representations of concerned and neighboring area feature vectors, while maximizing the loss between the same from concerned and remote regions. Further, we employ an attention module over the output hidden states of each encoder timepoint to generate a context vector. This vector is used as an input to a decoder module to predict patch severity grades at a future timepoint. Finally, we ensemble the patch classification scores to calculate patient-wise grades. Specifically, our framework predicts zone-wise disease severity for a patient on a given day by learning representations from the previous temporal CXRs. Our novel multi-institutional dataset comprises sequential CXR scans from N=93 patients. Our approach outperforms transfer learning and radiomic feature-based baseline approaches on this dataset.

IVJul 13, 2021
Attention based CNN-LSTM Network for Pulmonary Embolism Prediction on Chest Computed Tomography Pulmonary Angiograms

Sudhir Suman, Gagandeep Singh, Nicole Sakla et al.

With more than 60,000 deaths annually in the United States, Pulmonary Embolism (PE) is among the most fatal cardiovascular diseases. It is caused by an artery blockage in the lung; confirming its presence is time-consuming and is prone to over-diagnosis. The utilization of automated PE detection systems is critical for diagnostic accuracy and efficiency. In this study we propose a two-stage attention-based CNN-LSTM network for predicting PE, its associated type (chronic, acute) and corresponding location (leftsided, rightsided or central) on computed tomography (CT) examinations. We trained our model on the largest available public Computed Tomography Pulmonary Angiogram PE dataset (RSNA-STR Pulmonary Embolism CT (RSPECT) Dataset, N=7279 CT studies) and tested it on an in-house curated dataset of N=106 studies. Our framework mirrors the radiologic diagnostic process via a multi-slice approach so that the accuracy and pathologic sequela of true pulmonary emboli may be meticulously assessed, enabling physicians to better appraise the morbidity of a PE when present. Our proposed method outperformed a baseline CNN classifier and a single-stage CNN-LSTM network, achieving an AUC of 0.95 on the test set for detecting the presence of PE in the study.

QMMay 13, 2021
TopoTxR: A Topological Biomarker for Predicting Treatment Response in Breast Cancer

Fan Wang, Saarthak Kapse, Steven Liu et al.

Characterization of breast parenchyma on dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is a challenging task owing to the complexity of underlying tissue structures. Current quantitative approaches, including radiomics and deep learning models, do not explicitly capture the complex and subtle parenchymal structures, such as fibroglandular tissue. In this paper, we propose a novel method to direct a neural network's attention to a dedicated set of voxels surrounding biologically relevant tissue structures. By extracting multi-dimensional topological structures with high saliency, we build a topology-derived biomarker, TopoTxR. We demonstrate the efficacy of TopoTxR in predicting response to neoadjuvant chemotherapy in breast cancer. Our qualitative and quantitative results suggest differential topological behavior of breast tissue on treatment-naïve imaging, in patients who respond favorably to therapy versus those who do not.

IVMar 12, 2021
Radiomic Deformation and Textural Heterogeneity (R-DepTH) Descriptor to characterize Tumor Field Effect: Application to Survival Prediction in Glioblastoma

Marwa Ismail, Prateek Prasanna, Kaustav Bera et al.

The concept of tumor field effect implies that cancer is a systemic disease with its impact way beyond the visible tumor confines. For instance, in Glioblastoma (GBM), an aggressive brain tumor, the increase in intracranial pressure due to tumor burden often leads to brain herniation and poor outcomes. Our work is based on the rationale that highly aggressive tumors tend to grow uncontrollably, leading to pronounced biomechanical tissue deformations in the normal parenchyma, which when combined with local morphological differences in the tumor confines on MRI scans, will comprehensively capture tumor field effect. Specifically, we present an integrated MRI-based descriptor, radiomic-Deformation and Textural Heterogeneity (r-DepTH). This descriptor comprises measurements of the subtle perturbations in tissue deformations throughout the surrounding normal parenchyma due to mass effect. This involves non-rigidly aligning the patients MRI scans to a healthy atlas via diffeomorphic registration. The resulting inverse mapping is used to obtain the deformation field magnitudes in the normal parenchyma. These measurements are then combined with a 3D texture descriptor, Co-occurrence of Local Anisotropic Gradient Orientations (COLLAGE), which captures the morphological heterogeneity within the tumor confines, on MRI scans. R-DepTH, on N = 207 GBM cases (training set (St) = 128, testing set (Sv) = 79), demonstrated improved prognosis of overall survival by categorizing patients into low- (prolonged survival) and high-risk (poor survival) groups (on St, p-value = 0.0000035, and on Sv, p-value = 0.0024). R-DepTH descriptor may serve as a comprehensive MRI-based prognostic marker of disease aggressiveness and survival in solid tumors.

LGFeb 11, 2021
EventScore: An Automated Real-time Early Warning Score for Clinical Events

Ibrahim Hammoud, Prateek Prasanna, IV Ramakrishnan et al.

Early prediction of patients at risk of clinical deterioration can help physicians intervene and alter their clinical course towards better outcomes. In addition to the accuracy requirement, early warning systems must make the predictions early enough to give physicians enough time to intervene. Interpretability is also one of the challenges when building such systems since being able to justify the reasoning behind model decisions is desirable in clinical practice. In this work, we built an interpretable model for the early prediction of various adverse clinical events indicative of clinical deterioration. The model is evaluated on two datasets and four clinical events. The first dataset is collected in a predominantly COVID-19 positive population at Stony Brook Hospital. The second dataset is the MIMIC III dataset. The model was trained to provide early warning scores for ventilation, ICU transfer, and mortality prediction tasks on the Stony Brook Hospital dataset and to predict mortality and the need for vasopressors on the MIMIC III dataset. Our model first separates each feature into multiple ranges and then uses logistic regression with lasso penalization to select the subset of ranges for each feature. The model training is completely automated and doesn't require expert knowledge like other early warning scores. We compare our model to the Modified Early Warning Score (MEWS) and quick SOFA (qSOFA), commonly used in hospitals. We show that our model outperforms these models in the area under the receiver operating characteristic curve (AUROC) while having a similar or better median detection time on all clinical events, even when using fewer features. Unlike MEWS and qSOFA, our model can be entirely automated without requiring any manually recorded features. We also show that discretization improves model performance by comparing our model to a baseline logistic regression model.

QMJul 15, 2020
Predicting Clinical Outcomes in COVID-19 using Radiomics and Deep Learning on Chest Radiographs: A Multi-Institutional Study

Joseph Bae, Saarthak Kapse, Gagandeep Singh et al.

We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning classifiers to predict mechanical ventilation requirement and mortality were trained and evaluated using patient CXRs. A novel radiomic embedding framework was also explored for outcome prediction. All results are compared against radiologist grading of CXRs (zone-wise expert severity scores). Radiomic and DL classification models had mAUCs of 0.78+/-0.02 and 0.81+/-0.04, compared with expert scores mAUCs of 0.75+/-0.02 and 0.79+/-0.05 for mechanical ventilation requirement and mortality prediction, respectively. Combined classifiers using both radiomics and expert severity scores resulted in mAUCs of 0.79+/-0.04 and 0.83+/-0.04 for each prediction task, demonstrating improvement over either artificial intelligence or radiologist interpretation alone. Our results also suggest instances where inclusion of radiomic features in DL improves model predictions, something that might be explored in other pathologies. The models proposed in this study and the prognostic information they provide might aid physician decision making and resource allocation during the COVID-19 pandemic.

QMJun 17, 2020
Spatial-And-Context aware (SpACe) "virtual biopsy" radiogenomic maps to target tumor mutational status on structural MRI

Marwa Ismail, Ramon Correa, Kaustav Bera et al.

With growing emphasis on personalized cancer-therapies,radiogenomics has shown promise in identifying target tumor mutational status on routine imaging (i.e. MRI) scans. These approaches fall into 2 categories: (1) deep-learning/radiomics (context-based), using image features from the entire tumor to identify the gene mutation status, or (2) atlas (spatial)-based to obtain likelihood of gene mutation status based on population statistics. While many genes (i.e. EGFR, MGMT) are spatially variant, a significant challenge in reliable assessment of gene mutation status on imaging has been the lack of available co-localized ground truth for training the models. We present Spatial-And-Context aware (SpACe) "virtual biopsy" maps that incorporate context-features from co-localized biopsy site along with spatial-priors from population atlases, within a Least Absolute Shrinkage and Selection Operator (LASSO) regression model, to obtain a per-voxel probability of the presence of a mutation status (M+ vs M-). We then use probabilistic pair-wise Markov model to improve the voxel-wise prediction probability. We evaluate the efficacy of SpACe maps on MRI scans with co-localized ground truth obtained from corresponding biopsy, to predict the mutation status of 2 driver genes in Glioblastoma: (1) EGFR (n=91), and (2) MGMT (n=81). When compared against deep-learning (DL) and radiomic models, SpACe maps obtained training and testing accuracies of 90% (n=71) and 90.48% (n=21) in identifying EGFR amplification status,compared to 80% and 71.4% via radiomics, and 74.28% and 65.5% via DL. For MGMT status, training and testing accuracies using SpACe were 88.3% (n=61) and 71.5% (n=20), compared to 52.4% and 66.7% using radiomics,and 79.3% and 68.4% using DL. Following validation,SpACe maps could provide surgical navigation to improve localization of sampling sites for targeting of specific driver genes in cancer.