Rohit Kundu

CV
h-index63
13papers
293citations
Novelty55%
AI Score47

13 Papers

IVMar 27, 2022Code
MFSNet: A Multi Focus Segmentation Network for Skin Lesion Segmentation

Hritam Basak, Rohit Kundu, Ram Sarkar

Segmentation is essential for medical image analysis to identify and localize diseases, monitor morphological changes, and extract discriminative features for further diagnosis. Skin cancer is one of the most common types of cancer globally, and its early diagnosis is pivotal for the complete elimination of malignant tumors from the body. This research develops an Artificial Intelligence (AI) framework for supervised skin lesion segmentation employing the deep learning approach. The proposed framework, called MFSNet (Multi-Focus Segmentation Network), uses differently scaled feature maps for computing the final segmentation mask using raw input RGB images of skin lesions. In doing so, initially, the images are preprocessed to remove unwanted artifacts and noises. The MFSNet employs the Res2Net backbone, a recently proposed convolutional neural network (CNN), for obtaining deep features used in a Parallel Partial Decoder (PPD) module to get a global map of the segmentation mask. In different stages of the network, convolution features and multi-scale maps are used in two boundary attention (BA) modules and two reverse attention (RA) modules to generate the final segmentation output. MFSNet, when evaluated on three publicly available datasets: $PH^2$, ISIC 2017, and HAM10000, outperforms state-of-the-art methods, justifying the reliability of the framework. The relevant codes for the proposed approach are accessible at https://github.com/Rohit-Kundu/MFSNet

CVOct 26, 2022Code
IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised Medical Image Segmentation

Hritam Basak, Soumitri Chattopadhyay, Rohit Kundu et al.

Due to the scarcity of labeled data, Contrastive Self-Supervised Learning (SSL) frameworks have lately shown great potential in several medical image analysis tasks. However, the existing contrastive mechanisms are sub-optimal for dense pixel-level segmentation tasks due to their inability to mine local features. To this end, we extend the concept of metric learning to the segmentation task, using a dense (dis)similarity learning for pre-training a deep encoder network, and employing a semi-supervised paradigm to fine-tune for the downstream task. Specifically, we propose a simple convolutional projection head for obtaining dense pixel-level features, and a new contrastive loss to utilize these dense projections thereby improving the local representations. A bidirectional consistency regularization mechanism involving two-stream model training is devised for the downstream task. Upon comparison, our IDEAL method outperforms the SoTA methods by fair margins on cardiac MRI segmentation. Code available: https://github.com/hritam-98/IDEAL-ICASSP23

CVMar 28, 2022
Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches

Ayan Kumar Bhunia, Viswanatha Reddy Gajjala, Subhadeep Koley et al.

The human visual system is remarkable in learning new visual concepts from just a few examples. This is precisely the goal behind few-shot class incremental learning (FSCIL), where the emphasis is additionally placed on ensuring the model does not suffer from "forgetting". In this paper, we push the boundary further for FSCIL by addressing two key questions that bottleneck its ubiquitous application (i) can the model learn from diverse modalities other than just photo (as humans do), and (ii) what if photos are not readily accessible (due to ethical and privacy constraints). Our key innovation lies in advocating the use of sketches as a new modality for class support. The product is a "Doodle It Yourself" (DIY) FSCIL framework where the users can freely sketch a few examples of a novel class for the model to learn to recognize photos of that class. For that, we present a framework that infuses (i) gradient consensus for domain invariant learning, (ii) knowledge distillation for preserving old class information, and (iii) graph attention networks for message passing between old and novel classes. We experimentally show that sketches are better class support than text in the context of FSCIL, echoing findings elsewhere in the sketching literature.

CVSep 2, 2024
EarthGen: Generating the World from Top-Down Views

Ansh Sharma, Albert Xiao, Praneet Rathi et al.

In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation.

CVJun 9, 2021Code
Cervical Cytology Classification Using PCA & GWO Enhanced Deep Features Selection

Hritam Basak, Rohit Kundu, Sukanta Chakraborty et al.

Cervical cancer is one of the most deadly and common diseases among women worldwide. It is completely curable if diagnosed in an early stage, but the tedious and costly detection procedure makes it unviable to conduct population-wise screening. Thus, to augment the effort of the clinicians, in this paper, we propose a fully automated framework that utilizes Deep Learning and feature selection using evolutionary optimization for cytology image classification. The proposed framework extracts Deep feature from several Convolution Neural Network models and uses a two-step feature reduction approach to ensure reduction in computation cost and faster convergence. The features extracted from the CNN models form a large feature space whose dimensionality is reduced using Principal Component Analysis while preserving 99% of the variance. A non-redundant, optimal feature subset is selected from this feature space using an evolutionary optimization algorithm, the Grey Wolf Optimizer, thus improving the classification performance. Finally, the selected feature subset is used to train an SVM classifier for generating the final predictions. The proposed framework is evaluated on three publicly available benchmark datasets: Mendeley Liquid Based Cytology (4-class) dataset, Herlev Pap Smear (7-class) dataset, and the SIPaKMeD Pap Smear (5-class) dataset achieving classification accuracies of 99.47%, 98.32% and 97.87% respectively, thus justifying the reliability of the approach. The relevant codes for the proposed approach can be found in: https://github.com/DVLP-CMATERJU/Two-Step-Feature-Enhancement

CVDec 16, 2024
Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Rohit Kundu, Hao Xiong, Vishal Mohanty et al.

Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model's tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

CVOct 18, 2024
Multi-modal Pose Diffuser: A Multimodal Generative Conditional Pose Prior

Calvin-Khang Ta, Arindam Dutta, Rohit Kundu et al.

The Skinned Multi-Person Linear (SMPL) model plays a crucial role in 3D human pose estimation, providing a streamlined yet effective representation of the human body. However, ensuring the validity of SMPL configurations during tasks such as human mesh regression remains a significant challenge , highlighting the necessity for a robust human pose prior capable of discerning realistic human poses. To address this, we introduce MOPED: \underline{M}ulti-m\underline{O}dal \underline{P}os\underline{E} \underline{D}iffuser. MOPED is the first method to leverage a novel multi-modal conditional diffusion model as a prior for SMPL pose parameters. Our method offers powerful unconditional pose generation with the ability to condition on multi-modal inputs such as images and text. This capability enhances the applicability of our approach by incorporating additional context often overlooked in traditional pose priors. Extensive experiments across three distinct tasks-pose estimation, pose denoising, and pose completion-demonstrate that our multi-modal diffusion model-based prior significantly outperforms existing methods. These results indicate that our model captures a broader spectrum of plausible human poses.

LGAug 20, 2025
Towards Source-Free Machine Unlearning

Sk Miraj Ahmed, Umit Yigit Basaran, Dripta S. Raychaudhuri et al.

As machine learning becomes more pervasive and data privacy regulations evolve, the ability to remove private or copyrighted information from trained models is becoming an increasingly critical requirement. Existing unlearning methods often rely on the assumption of having access to the entire training dataset during the forgetting process. However, this assumption may not hold true in practical scenarios where the original training data may not be accessible, i.e., the source-free setting. To address this challenge, we focus on the source-free unlearning scenario, where an unlearning algorithm must be capable of removing specific data from a trained model without requiring access to the original training dataset. Building on recent work, we present a method that can estimate the Hessian of the unknown remaining training data, a crucial component required for efficient unlearning. Leveraging this estimation technique, our method enables efficient zero-shot unlearning while providing robust theoretical guarantees on the unlearning performance, while maintaining performance on the remaining data. Extensive experiments over a wide range of datasets verify the efficacy of our method.

CVNov 16, 2025
SAGA: Source Attribution of Generative AI Videos

Rohit Kundu, Vishal Mohanty, Hao Xiong et al.

The proliferation of generative AI has led to hyper-realistic synthetic videos, escalating misuse risks and outstripping binary real/fake detectors. We introduce SAGA (Source Attribution of Generative AI videos), the first comprehensive framework to address the urgent need for AI-generated video source attribution at a large scale. Unlike traditional detection, SAGA identifies the specific generative model used. It uniquely provides multi-granular attribution across five levels: authenticity, generation task (e.g., T2V/I2V), model version, development team, and the precise generator, offering far richer forensic insights. Our novel video transformer architecture, leveraging features from a robust vision foundation model, effectively captures spatio-temporal artifacts. Critically, we introduce a data-efficient pretrain-and-attribute strategy, enabling SAGA to achieve state-of-the-art attribution using only 0.5\% of source-labeled data per class, matching fully supervised performance. Furthermore, we propose Temporal Attention Signatures (T-Sigs), a novel interpretability method that visualizes learned temporal differences, offering the first explanation for why different video generators are distinguishable. Extensive experiments on public datasets, including cross-domain scenarios, demonstrate that SAGA sets a new benchmark for synthetic video provenance, providing crucial, interpretable insights for forensic and regulatory applications.

CVMar 20, 2025
TruthLens: Visual Grounding for Universal DeepFake Reasoning

Rohit Kundu, Shan Jia, Vishal Mohanty et al.

Detecting DeepFakes has become a crucial research area as the widespread use of AI image generators enables the effortless creation of face-manipulated and fully synthetic content, while existing methods are often limited to binary classification (real vs. fake) and lack interpretability. To address these challenges, we propose TruthLens, a novel, unified, and highly generalizable framework that goes beyond traditional binary classification, providing detailed, textual reasoning for its predictions. Distinct from conventional methods, TruthLens performs MLLM grounding. TruthLens uses a task-driven representation integration strategy that unites global semantic context from a multimodal large language model (MLLM) with region-specific forensic cues through explicit cross-modal adaptation of a vision-only model. This enables nuanced, region-grounded reasoning for both face-manipulated and fully synthetic content, and supports fine-grained queries such as "Does the eyes/nose/mouth look real or fake?"- capabilities beyond pretrained MLLMs alone. Extensive experiments across diverse datasets demonstrate that TruthLens sets a new benchmark in both forensic interpretability and detection accuracy, generalizing to seen and unseen manipulations alike. By unifying high-level scene understanding with fine-grained region grounding, TruthLens delivers transparent DeepFake forensics, bridging a critical gap in the literature.

CVDec 5, 2023
Repurposing SAM for User-Defined Semantics Aware Segmentation

Rohit Kundu, Sudipta Paul, Arindam Dutta et al.

The Segment Anything Model (SAM) excels at generating precise object masks from input prompts but lacks semantic awareness, failing to associate its generated masks with specific object categories. To address this limitation, we propose U-SAM, a novel framework that imbibes semantic awareness into SAM, enabling it to generate targeted masks for user-specified object categories. Given only object class names as input from the user, U-SAM provides pixel-level semantic annotations for images without requiring any labeled/unlabeled samples from the test data distribution. Our approach leverages synthetically generated or web crawled images to accumulate semantic information about the desired object classes. We then learn a mapping function between SAM's mask embeddings and object class labels, effectively enhancing SAM with granularity-specific semantic recognition capabilities. As a result, users can obtain meaningful and targeted segmentation masks for specific objects they request, rather than generic and unlabeled masks. We evaluate U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements of +17.95% and +5.20%, respectively, over state-of-the-art methods. By transforming SAM into a semantically aware segmentation model, U-SAM offers a practical and flexible solution for pixel-level annotation across diverse and unseen domains in a resource-constrained environment.

IVSep 17, 2021
Segmentation of Brain MRI using an Altruistic Harris Hawks' Optimization algorithm

Rajarshi Bandyopadhyay, Rohit Kundu, Diego Oliva et al.

Segmentation is an essential requirement in medicine when digital images are used in illness diagnosis, especially, in posterior tasks as analysis and disease identification. An efficient segmentation of brain Magnetic Resonance Images (MRIs) is of prime concern to radiologists due to their poor illumination and other conditions related to de acquisition of the images. Thresholding is a popular method for segmentation that uses the histogram of an image to label different homogeneous groups of pixels into different classes. However, the computational cost increases exponentially according to the number of thresholds. In this paper, we perform the multi-level thresholding using an evolutionary metaheuristic. It is an improved version of the Harris Hawks Optimization (HHO) algorithm that combines the chaotic initialization and the concept of altruism. Further, for fitness assignment, we use a hybrid objective function where along with the cross-entropy minimization, we apply a new entropy function, and leverage weights to the two objective functions to form a new hybrid approach. The HHO was originally designed to solve numerical optimization problems. Earlier, the statistical results and comparisons have demonstrated that the HHO provides very promising results compared with well-established metaheuristic techniques. In this article, the altruism has been incorporated into the HHO algorithm to enhance its exploitation capabilities. We evaluate the proposed method over 10 benchmark images from the WBA database of the Harvard Medical School and 8 benchmark images from the Brainweb dataset using some standard evaluation metrics.

CVAug 21, 2021
Ensemble of CNN classifiers using Sugeno Fuzzy Integral Technique for Cervical Cytology Image Classification

Rohit Kundu, Hritam Basak, Akhil Koilada et al.

Cervical cancer is the fourth most common category of cancer, affecting more than 500,000 women annually, owing to the slow detection procedure. Early diagnosis can help in treating and even curing cancer, but the tedious, time-consuming testing process makes it impossible to conduct population-wise screening. To aid the pathologists in efficient and reliable detection, in this paper, we propose a fully automated computer-aided diagnosis tool for classifying single-cell and slide images of cervical cancer. The main concern in developing an automatic detection tool for biomedical image classification is the low availability of publicly accessible data. Ensemble Learning is a popular approach for image classification, but simplistic approaches that leverage pre-determined weights to classifiers fail to perform satisfactorily. In this research, we use the Sugeno Fuzzy Integral to ensemble the decision scores from three popular pretrained deep learning models, namely, Inception v3, DenseNet-161 and ResNet-34. The proposed Fuzzy fusion is capable of taking into consideration the confidence scores of the classifiers for each sample, and thus adaptively changing the importance given to each classifier, capturing the complementary information supplied by each, thus leading to superior classification performance. We evaluated the proposed method on three publicly available datasets, the Mendeley Liquid Based Cytology (LBC) dataset, the SIPaKMeD Whole Slide Image (WSI) dataset, and the SIPaKMeD Single Cell Image (SCI) dataset, and the results thus yielded are promising. Analysis of the approach using GradCAM-based visual representations and statistical tests, and comparison of the method with existing and baseline models in literature justify the efficacy of the approach.