Ravi Soni

IV
h-index11
10papers
44citations
Novelty40%
AI Score40

10 Papers

IVAug 8, 2024
Is SAM 2 Better than SAM in Medical Image Segmentation?

Sourya Sengupta, Satrajit Chakrabarty, Ravi Soni

The Segment Anything Model (SAM) has demonstrated impressive performance in zero-shot promptable segmentation on natural images. The recently released Segment Anything Model 2 (SAM 2) claims to outperform SAM on images and extends the model's capabilities to video segmentation. Evaluating the performance of this new model in medical image segmentation, specifically in a zero-shot promptable manner, is crucial. In this work, we conducted extensive studies using multiple datasets from various imaging modalities to compare the performance of SAM and SAM 2. We employed two point-prompt strategies: (i) multiple positive prompts where one prompt is placed near the centroid of the target structure, while the remaining prompts are randomly placed within the structure, and (ii) combined positive and negative prompts where one positive prompt is placed near the centroid of the target structure, and two negative prompts are positioned outside the structure, maximizing the distance from the positive prompt and from each other. The evaluation encompassed 24 unique organ-modality combinations, including abdominal structures, cardiac structures, fetal head images, skin lesions and polyp images across 11 publicly available MRI, CT, ultrasound, dermoscopy, and endoscopy datasets. Preliminary results based on 2D images indicate that while SAM 2 may perform slightly better in a few cases, it does not generally surpass SAM for medical image segmentation. Notably, SAM 2 performs worse than SAM in lower contrast imaging modalities, such as CT and ultrasound. However, for MRI images, SAM 2 performs on par with or better than SAM. Like SAM, SAM 2 also suffers from over-segmentation issues, particularly when the boundaries of the target organ are fuzzy.

CVJul 15, 2022
Adversarial Focal Loss: Asking Your Discriminator for Hard Examples

Chen Liu, Xiaomeng Dong, Michael Potter et al.

Focal Loss has reached incredible popularity as it uses a simple technique to identify and utilize hard examples to achieve better performance on classification. However, this method does not easily generalize outside of classification tasks, such as in keypoint detection. In this paper, we propose a novel adaptation of Focal Loss for keypoint detection tasks, called Adversarial Focal Loss (AFL). AFL not only is semantically analogous to Focal loss, but also works as a plug-and-chug upgrade for arbitrary loss functions. While Focal Loss requires output from a classifier, AFL leverages a separate adversarial network to produce a difficulty score for each input. This difficulty score can then be used to dynamically prioritize learning on hard examples, even in absence of a classifier. In this work, we show AFL's effectiveness in enhancing existing methods in keypoint detection and verify its capability to re-weigh examples based on difficulty.

IVJan 14
POWDR: Pathology-preserving Outpainting with Wavelet Diffusion for 3D MRI

Fei Tan, Ashok Vardhan Addala, Bruno Astuto Arouche Nunes et al.

Medical imaging datasets often suffer from class imbalance and limited availability of pathology-rich cases, which constrains the performance of machine learning models for segmentation, classification, and vision-language tasks. To address this challenge, we propose POWDR, a pathology-preserving outpainting framework for 3D MRI based on a conditioned wavelet diffusion model. Unlike conventional augmentation or unconditional synthesis, POWDR retains real pathological regions while generating anatomically plausible surrounding tissue, enabling diversity without fabricating lesions. Our approach leverages wavelet-domain conditioning to enhance high-frequency detail and mitigate blurring common in latent diffusion models. We introduce a random connected mask training strategy to overcome conditioning-induced collapse and improve diversity outside the lesion. POWDR is evaluated on brain MRI using BraTS datasets and extended to knee MRI to demonstrate tissue-agnostic applicability. Quantitative metrics (FID, SSIM, LPIPS) confirm image realism, while diversity analysis shows significant improvement with random-mask training (cosine similarity reduced from 0.9947 to 0.9580; KL divergence increased from 0.00026 to 0.01494). Clinically relevant assessments reveal gains in tumor segmentation performance using nnU-Net, with Dice scores improving from 0.6992 to 0.7137 when adding 50 synthetic cases. Tissue volume analysis indicates no significant differences for CSF and GM compared to real images. These findings highlight POWDR as a practical solution for addressing data scarcity and class imbalance in medical imaging. The method is extensible to multiple anatomies and offers a controllable framework for generating diverse, pathology-preserving synthetic data to support robust model development.

CVNov 7, 2025
Towards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective

Xing Yao, Ahana Gangopadhyay, Hsi-Ming Chang et al.

Ultrasound (US) video segmentation remains a challenging problem due to strong inter- and intra-dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero-shot and prompt-guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data-centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training-set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task-specific fine-tuning, intermediate adaptation, and multi-task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound-specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis.

IVNov 26, 2025
Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data

Satrajit Chakrabarty, Ravi Soni

Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.

IVApr 11, 2025
SynthFM: Training Modality-agnostic Foundation Models for Medical Image Segmentation without Real Medical Data

Sourya Sengupta, Satrajit Chakrabarty, Keerthi Sravan Ravi et al.

Foundation models like the Segment Anything Model (SAM) excel in zero-shot segmentation for natural images but struggle with medical image segmentation due to differences in texture, contrast, and noise. Annotating medical images is costly and requires domain expertise, limiting large-scale annotated data availability. To address this, we propose SynthFM, a synthetic data generation framework that mimics the complexities of medical images, enabling foundation models to adapt without real medical data. Using SAM's pretrained encoder and training the decoder from scratch on SynthFM's dataset, we evaluated our method on 11 anatomical structures across 9 datasets (CT, MRI, and Ultrasound). SynthFM outperformed zero-shot baselines like SAM and MedSAM, achieving superior results under different prompt settings and on out-of-distribution datasets.

IVMar 7, 2025
SAS: Segment Anything Small for Ultrasound -- A Non-Generative Data Augmentation Technique for Robust Deep Learning in Ultrasound Imaging

Danielle L. Ferreira, Ahana Gangopadhyay, Hsi-Ming Chang et al.

Accurate segmentation of anatomical structures in ultrasound (US) images, particularly small ones, is challenging due to noise and variability in imaging conditions (e.g., probe position, patient anatomy, tissue characteristics and pathology). To address this, we introduce Segment Anything Small (SAS), a simple yet effective scale- and texture-aware data augmentation technique designed to enhance the performance of deep learning models for segmenting small anatomical structures in ultrasound images. SAS employs a dual transformation strategy: (1) simulating diverse organ scales by resizing and embedding organ thumbnails into a black background, and (2) injecting noise into regions of interest to simulate varying tissue textures. These transformations generate realistic and diverse training data without introducing hallucinations or artifacts, improving the model's robustness to noise and variability. We fine-tuned a promptable foundation model on a controlled organ-specific medical imaging dataset and evaluated its performance on one internal and five external datasets. Experimental results demonstrate significant improvements in segmentation performance, with Dice score gains of up to 0.35 and an average improvement of 0.16 [95% CI 0.132,0.188]. Additionally, our iterative point prompts provide precise control and adaptive refinement, achieving performance comparable to bounding box prompts with just two points. SAS enhances model robustness and generalizability across diverse anatomical structures and imaging conditions, particularly for small structures, without compromising the accuracy of larger ones. By offering a computationally efficient solution that eliminates the need for extensive human labeling efforts, SAS emerges as a powerful tool for advancing medical image analysis, particularly in resource-constrained settings.

IVFeb 14, 2025
SAMRI-2: A Memory-based Model for Cartilage and Meniscus Segmentation in 3D MRIs of the Knee Joint

Danielle L. Ferreira, Bruno A. A. Nunes, Xuzhe Zhang et al.

Accurate morphometric assessment of cartilage-such as thickness/volume-via MRI is essential for monitoring knee osteoarthritis. Segmenting cartilage remains challenging and dependent on extensive expert-annotated datasets, which are heavily subjected to inter-reader variability. Recent advancements in Visual Foundational Models (VFM), especially memory-based approaches, offer opportunities for improving generalizability and robustness. This study introduces a deep learning (DL) method for cartilage and meniscus segmentation from 3D MRIs using interactive, memory-based VFMs. To improve spatial awareness and convergence, we incorporated a Hybrid Shuffling Strategy (HSS) during training and applied a segmentation mask propagation technique to enhance annotation efficiency. We trained four AI models-a CNN-based 3D-VNet, two automatic transformer-based models (SaMRI2D and SaMRI3D), and a transformer-based promptable memory-based VFM (SAMRI-2)-on 3D knee MRIs from 270 patients using public and internal datasets and evaluated on 57 external cases, including multi-radiologist annotations and different data acquisitions. Model performance was assessed against reference standards using Dice Score (DSC) and Intersection over Union (IoU), with additional morphometric evaluations to further quantify segmentation accuracy. SAMRI-2 model, trained with HSS, outperformed all other models, achieving an average DSC improvement of 5 points, with a peak improvement of 12 points for tibial cartilage. It also demonstrated the lowest cartilage thickness errors, reducing discrepancies by up to threefold. Notably, SAMRI-2 maintained high performance with as few as three user clicks per volume, reducing annotation effort while ensuring anatomical precision. This memory-based VFM with spatial awareness offers a novel approach for reliable AI-assisted knee MRI segmentation, advancing DL in musculoskeletal imaging.

IVNov 10, 2020
Pristine annotations-based multi-modal trained artificial intelligence solution to triage chest X-ray for COVID-19

Tao Tan, Bipul Das, Ravi Soni et al.

The COVID-19 pandemic continues to spread and impact the well-being of the global population. The front-line modalities including computed tomography (CT) and X-ray play an important role for triaging COVID patients. Considering the limited access of resources (both hardware and trained personnel) and decontamination considerations, CT may not be ideal for triaging suspected subjects. Artificial intelligence (AI) assisted X-ray based applications for triaging and monitoring require experienced radiologists to identify COVID patients in a timely manner and to further delineate the disease region boundary are seen as a promising solution. Our proposed solution differs from existing solutions by industry and academic communities, and demonstrates a functional AI model to triage by inferencing using a single x-ray image, while the deep-learning model is trained using both X-ray and CT data. We report on how such a multi-modal training improves the solution compared to X-ray only training. The multi-modal solution increases the AUC (area under the receiver operating characteristic curve) from 0.89 to 0.93 and also positively impacts the Dice coefficient (0.59 to 0.62) for localizing the pathology. To the best our knowledge, it is the first X-ray solution by leveraging multi-modal information for the development.

IRMay 10, 2020
Transformer Based Language Models for Similar Text Retrieval and Ranking

Javed Qadrud-Din, Ashraf Bah Rabiou, Ryan Walker et al.

Most approaches for similar text retrieval and ranking with long natural language queries rely at some level on queries and responses having words in common with each other. Recent applications of transformer-based neural language models to text retrieval and ranking problems have been very promising, but still involve a two-step process in which result candidates are first obtained through bag-of-words-based approaches, and then reranked by a neural transformer. In this paper, we introduce novel approaches for effectively applying neural transformer models to similar text retrieval and ranking without an initial bag-of-words-based step. By eliminating the bag-of-words-based step, our approach is able to accurately retrieve and rank results even when they have no non-stopwords in common with the query. We accomplish this by using bidirectional encoder representations from transformers (BERT) to create vectorized representations of sentence-length texts, along with a vector nearest neighbor search index. We demonstrate both supervised and unsupervised means of using BERT to accomplish this task.