Rajarshi Bhattacharya

CV
h-index40
6papers
35citations
Novelty57%
AI Score46

6 Papers

CVJun 30, 2023Code
Prompting classes: Exploring the Power of Prompt Class Learning in Weakly Supervised Semantic Segmentation

Balamurali Murugesan, Rukhshanda Hussain, Rajarshi Bhattacharya et al.

Recently, CLIP-based approaches have exhibited remarkable performance on generalization and few-shot learning tasks, fueled by the power of contrastive language-vision pre-training. In particular, prompt tuning has emerged as an effective strategy to adapt the pre-trained language-vision models to downstream tasks by employing task-related textual tokens. Motivated by this progress, in this work we question whether other fundamental problems, such as weakly supervised semantic segmentation (WSSS), can benefit from prompt tuning. Our findings reveal two interesting observations that shed light on the impact of prompt tuning on WSSS. First, modifying only the class token of the text prompt results in a greater impact on the Class Activation Map (CAM), compared to arguably more complex strategies that optimize the context. And second, the class token associated with the image ground truth does not necessarily correspond to the category that yields the best CAM. Motivated by these observations, we introduce a novel approach based on a PrOmpt cLass lEarning (POLE) strategy. Through extensive experiments we demonstrate that our simple, yet efficient approach achieves SOTA performance in a well-known WSSS benchmark. These results highlight not only the benefits of language-vision models in WSSS but also the potential of prompt learning for this problem. The code is available at https://github.com/rB080/WSS_POLE.

CVApr 16, 2025Code
Beyond Patches: Mining Interpretable Part-Prototypes for Explainable AI

Mahdi Alehdaghi, Rajarshi Bhattacharya, Pourya Shamsolmoali et al.

As AI systems grow more capable, it becomes increasingly important that their decisions remain understandable and aligned with human expectations. A key challenge is the limited interpretability of deep models. Post-hoc methods like GradCAM offer heatmaps but provide limited conceptual insight, while prototype-based approaches offer example-based explanations but often rely on rigid region selection and lack semantic consistency. To address these limitations, we propose PCMNet, a part-prototypical concept mining network that learns human-comprehensible prototypes from meaningful image regions without additional supervision. By clustering these prototypes into concept groups and extracting concept activation vectors, PCMNet provides structured, concept-level explanations and enhances robustness to occlusion and challenging conditions, which are both critical for building reliable and aligned AI systems. Experiments across multiple image classification benchmarks show that PCMNet outperforms state-of-the-art methods in interpretability, stability, and robustness. This work contributes to AI alignment by enhancing transparency, controllability, and trustworthiness in AI systems. Our code is available at: https://github.com/alehdaghi/PCMNet.

IVFeb 1, 2022Code
An Embarrassingly Simple Consistency Regularization Method for Semi-Supervised Medical Image Segmentation

Hritam Basak, Rajarshi Bhattacharya, Rukhshanda Hussain et al.

The scarcity of pixel-level annotation is a prevalent problem in medical image segmentation tasks. In this paper, we introduce a novel regularization strategy involving interpolation-based mixing for semi-supervised medical image segmentation. The proposed method is a new consistency regularization strategy that encourages segmentation of interpolation of two unlabelled data to be consistent with the interpolation of segmentation maps of those data. This method represents a specific type of data-adaptive regularization paradigm which aids to minimize the overfitting of labelled data under high confidence values. The proposed method is advantageous over adversarial and generative models as it requires no additional computation. Upon evaluation on two publicly available MRI datasets: ACDC and MMWHS, experimental results demonstrate the superiority of the proposed method in comparison to existing semi-supervised models. Code is available at: https://github.com/hritam-98/ICT-MedSeg

41.0CVApr 29
InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

Shakeeb Murtaza, Aryan Shukla, Rajarshi Bhattacharya et al.

Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.

CVJan 23, 2025
From Cross-Modal to Mixed-Modal Visible-Infrared Re-Identification

Mahdi Alehdaghi, Rajarshi Bhattacharya, Pourya Shamsolmoali et al.

Visible-infrared person re-identification (VI-ReID) aims to match individuals across different camera modalities, a critical task in modern surveillance systems. While current VI-ReID methods focus on cross-modality matching, real-world applications often involve mixed galleries containing both V and I images, where state-of-the-art methods show significant performance limitations due to large domain shifts and low discrimination across mixed modalities. This is because gallery images from the same modality may have lower domain gaps but correspond to different identities. This paper introduces a novel mixed-modal ReID setting, where galleries contain data from both modalities. To address the domain shift among inter-modal and low discrimination capacity in intra-modal matching, we propose the Mixed Modality-Erased and -Related (MixER) method. The MixER learning approach disentangles modality-specific and modality-shared identity information through orthogonal decomposition, modality-confusion, and ID-modality-related objectives. MixER enhances feature robustness across modalities, improving cross-modal and mixed-modal settings performance. Our extensive experiments on the SYSU-MM01, RegDB and LLMC datasets indicate that our approach can provide state-of-the-art results using a single backbone, and showcase the flexibility of our approach in mixed gallery applications.

CVMay 23, 2025
DART$^3$: Leveraging Distance for Test Time Adaptation in Person Re-Identification

Rajarshi Bhattacharya, Shakeeb Murtaza, Christian Desrosiers et al.

Person re-identification (ReID) models are known to suffer from camera bias, where learned representations cluster according to camera viewpoints rather than identity, leading to significant performance degradation under (inter-camera) domain shifts in real-world surveillance systems when new cameras are added to camera networks. State-of-the-art test-time adaptation (TTA) methods, largely designed for classification tasks, rely on classification entropy-based objectives that fail to generalize well to ReID, thus making them unsuitable for tackling camera bias. In this paper, we introduce DART$^3$, a TTA framework specifically designed to mitigate camera-induced domain shifts in person ReID. DART$^3$ (Distance-Aware Retrieval Tuning at Test Time) leverages a distance-based objective that aligns better with image retrieval tasks like ReID by exploiting the correlation between nearest-neighbor distance and prediction error. Unlike prior ReID-specific domain adaptation methods, DART$^3$ requires no source data, architectural modifications, or retraining, and can be deployed in both fully black-box and hybrid settings. Empirical evaluations on multiple ReID benchmarks indicate that DART$^3$ and DART$^3$ LITE, a lightweight alternative to the approach, consistently outperforms state-of-the-art TTA baselines, making for a viable option to online learning to mitigate the adverse effects of camera bias.