CVJun 3, 2022
Additive MIL: Intrinsically Interpretable Multiple Instance Learning for PathologySyed Ashar Javed, Dinkar Juyal, Harshith Padigela et al.
Multiple Instance Learning (MIL) has been widely applied in pathology towards solving critical problems such as automating cancer diagnosis and grading, predicting patient prognosis, and therapy response. Deploying these models in a clinical setting requires careful inspection of these black boxes during development and deployment to identify failures and maintain physician trust. In this work, we propose a simple formulation of MIL models, which enables interpretability while maintaining similar predictive performance. Our Additive MIL models enable spatial credit assignment such that the contribution of each region in the image can be exactly computed and visualized. We show that our spatial credit assignment coincides with regions used by pathologists during diagnosis and improves upon classical attention heatmaps from attention MIL models. We show that any existing MIL model can be made additive with a simple change in function composition. We also show how these models can debug model failures, identify spurious features, and highlight class-wise regions of interest, enabling their use in high-stakes environments such as clinical decision-making.
IVJul 15, 2024
Learning biologically relevant features in a pathology foundation model using sparse autoencodersNhat Minh Le, Ciyue Shen, Neel Patel et al.
Pathology plays an important role in disease diagnosis, treatment decision-making and drug development. Previous works on interpretability for machine learning models on pathology images have revolved around methods such as attention value visualization and deriving human-interpretable features from model heatmaps. Mechanistic interpretability is an emerging area of model interpretability that focuses on reverse-engineering neural networks. Sparse Autoencoders (SAEs) have emerged as a promising direction in terms of extracting monosemantic features from polysemantic model activations. In this work, we trained a Sparse Autoencoder on the embeddings of a pathology pretrained foundation model. We found that Sparse Autoencoder features represent interpretable and monosemantic biological concepts. In particular, individual SAE dimensions showed strong correlations with cell type counts such as plasma cells and lymphocytes. These biological representations were unique to the pathology pretrained model and were not found in a self-supervised model pretrained on natural images. We demonstrated that such biologically-grounded monosemantic representations evolved across the model's depth, and the pathology foundation model eventually gained robustness to non-biological factors such as scanner type. The emergence of biologically relevant SAE features was generalizable to an out-of-domain dataset. Our work paves the way for further exploration around interpretable feature dimensions and their utility for medical and clinical applications.
CVMar 23, 2023
SC-MIL: Supervised Contrastive Multiple Instance Learning for Imbalanced Classification in PathologyDinkar Juyal, Siddhant Shingi, Syed Ashar Javed et al.
Multiple Instance learning (MIL) models have been extensively used in pathology to predict biomarkers and risk-stratify patients from gigapixel-sized images. Machine learning problems in medical imaging often deal with rare diseases, making it important for these models to work in a label-imbalanced setting. In pathology images, there is another level of imbalance, where given a positively labeled Whole Slide Image (WSI), only a fraction of pixels within it contribute to the positive label. This compounds the severity of imbalance and makes imbalanced classification in pathology challenging. Furthermore, these imbalances can occur in out-of-distribution (OOD) datasets when the models are deployed in the real-world. We leverage the idea that decoupling feature and classifier learning can lead to improved decision boundaries for label imbalanced datasets. To this end, we investigate the integration of supervised contrastive learning with multiple instance learning (SC-MIL). Specifically, we propose a joint-training MIL framework in the presence of label imbalance that progressively transitions from learning bag-level representations to optimal classifier learning. We perform experiments with different imbalance settings for two well-studied problems in cancer pathology: subtyping of non-small cell lung cancer and subtyping of renal cell carcinoma. SC-MIL provides large and consistent improvements over other techniques on both in-distribution (ID) and OOD held-out sets across multiple imbalanced settings.
IVApr 11, 2022
Rethinking Machine Learning Model Evaluation in PathologySyed Ashar Javed, Dinkar Juyal, Zahil Shanis et al.
Machine Learning has been applied to pathology images in research and clinical practice with promising outcomes. However, standard ML models often lack the rigorous evaluation required for clinical decisions. Machine learning techniques for natural images are ill-equipped to deal with pathology images that are significantly large and noisy, require expensive labeling, are hard to interpret, and are susceptible to spurious correlations. We propose a set of practical guidelines for ML evaluation in pathology that address the above concerns. The paper includes measures for setting up the evaluation framework, effectively dealing with variability in labels, and a recommended suite of tests to address issues related to domain shift, robustness, and confounding variables. We hope that the proposed framework will bridge the gap between ML researchers and domain experts, leading to wider adoption of ML techniques in pathology and improving patient outcomes.
CVNov 4, 2025
PLUTO-4: Frontier Pathology Foundation ModelsHarshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti et al.
Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including tile classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.
IVMay 13, 2024
PLUTO: Pathology-Universal TransformerDinkar Juyal, Harshith Padigela, Chintan Shah et al.
Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.
CVDec 11, 2019
CineFilter: Unsupervised Filtering for Real Time Autonomous Camera SystemsSudheer Achary, K L Bhanu Moorthy, Syed Ashar Javed et al.
Autonomous camera systems are often subjected to an optimization/filtering operation to smoothen and stabilize the rough trajectory estimates. Most common filtering techniques do reduce the irregularities in data; however, they fail to mimic the behavior of a human cameraman. Global filtering methods modeling human camera operators have been successful; however, they are limited to offline settings. In this paper, we propose two online filtering methods called Cinefilters, which produce smooth camera trajectories that are motivated by cinematographic principles. The first filter (CineConvex) uses a sliding window-based convex optimization formulation, and the second (CineCNN) is a CNN based encoder-decoder model. We evaluate the proposed filters in two different settings, namely a basketball dataset and a stage performance dataset. Our models outperform previous methods and baselines on both quantitative and qualitative metrics. The CineConvex and CineCNN filters operate at about 250fps and 1000fps, respectively, with a minor latency (half a second), making them apt for a variety of real-time applications.
CVOct 2, 2019
Embodied Language Grounding with 3D Visual Feature RepresentationsMihir Prabhudesai, Hsiao-Yu Fish Tung, Syed Ashar Javed et al.
We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image. We present generative models that condition on the dependency tree of an utterance and generate a corresponding visual 3D feature map as well as reason about its plausibility, and detector models that condition on both the dependency tree of an utterance and a related image and localize the object referents in the 3D feature map inferred from the image. Our model outperforms models of language and vision that associate language with 2D CNN activations or 2D images by a large margin in a variety of tasks, such as, classifying plausibility of utterances, detecting referential expressions, and supplying rewards for trajectory optimization of object placement policies from language instructions. We perform numerous ablations and show the improved performance of our detectors is due to its better generalization across camera viewpoints and lack of object interferences in the inferred 3D feature space, and the improved performance of our generators is due to their ability to spatially reason about objects and their configurations in 3D when mapping from language to scenes.
CVMar 17, 2018
MergeNet: A Deep Net Architecture for Small Obstacle DiscoveryKrishnam Gupta, Syed Ashar Javed, Vineet Gandhi et al.
We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples.
CVMar 17, 2018
Learning Unsupervised Visual Grounding Through Semantic Self-SupervisionSyed Ashar Javed, Shreyas Saxena, Vineet Gandhi
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset.
CVMay 11, 2017
Object-Level Context Modeling For Scene Classification with Context-CNNSyed Ashar Javed, Anil Kumar Nelakanti
Convolutional Neural Networks (CNNs) have been used extensively for computer vision tasks and produce rich feature representation for objects or parts of an image. But reasoning about scenes requires integration between the low-level feature representations and the high-level semantic information. We propose a deep network architecture which models the semantic context of scenes by capturing object-level information. We use Long Short Term Memory(LSTM) units in conjunction with object proposals to incorporate object-object relationship and object-scene relationship in an end-to-end trainable manner. We evaluate our model on the LSUN dataset and achieve results comparable to the state-of-art. We further show visualization of the learned features and analyze the model with experiments to verify our model's ability to model context.