CVNov 8, 2025Code
How Reasoning Influences Intersectional Biases in Vision Language ModelsAdit Desai, Sudipta Roy, Mohna Chakraborty
Vision Language Models (VLMs) are increasingly deployed across downstream tasks, yet their training data often encode social biases that surface in outputs. Unlike humans, who interpret images through contextual and social cues, VLMs process them through statistical associations, often leading to reasoning that diverges from human reasoning. By analyzing how a VLM reasons, we can understand how inherent biases are perpetuated and can adversely affect downstream performance. To examine this gap, we systematically analyze social biases in five open-source VLMs for an occupation prediction task, on the FairFace dataset. Across 32 occupations and three different prompting styles, we elicit both predictions and reasoning. Our findings reveal that the biased reasoning patterns systematically underlie intersectional disparities, highlighting the need to align VLM reasoning with human values prior to its downstream deployment.
CVFeb 19
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMsBehzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy et al.
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
CVDec 2, 2024Code
Multimodal Fusion Learning with Dual Attention for Medical ImagingJoy Dhar, Nayyar Zaidi, Maryam Haghighat et al.
Multimodal fusion learning has shown significant promise in classifying various diseases such as skin cancer and brain tumors. However, existing methods face three key limitations. First, they often lack generalizability to other diagnosis tasks due to their focus on a particular disease. Second, they do not fully leverage multiple health records from diverse modalities to learn robust complementary information. And finally, they typically rely on a single attention mechanism, missing the benefits of multiple attention strategies within and across various modalities. To address these issues, this paper proposes a dual robust information fusion attention mechanism (DRIFA) that leverages two attention modules, i.e. multi-branch fusion attention module and the multimodal information fusion attention module. DRIFA can be integrated with any deep neural network, forming a multimodal fusion learning framework denoted as DRIFA-Net. We show that the multi-branch fusion attention of DRIFA learns enhanced representations for each modality, such as dermoscopy, pap smear, MRI, and CT-scan, whereas multimodal information fusion attention module learns more refined multimodal shared representations, improving the network's generalization across multiple tasks and enhancing overall performance. Additionally, to estimate the uncertainty of DRIFA-Net predictions, we have employed an ensemble Monte Carlo dropout strategy. Extensive experiments on five publicly available datasets with diverse modalities demonstrate that our approach consistently outperforms state-of-the-art methods. The code is available at https://github.com/misti1203/DRIFA-Net.
CVMar 21, 2025Code
Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in HistopathologyDevavrat Tomar, Guillaume Vray, Dwarikanath Mahapatra et al.
In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method's superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at https://github.com/LTS5/SLIP.
CVFeb 25, 2019Code
Bengali Handwritten Character Classification using Transfer Learning on Deep Convolutional Neural NetworkSwagato Chatterjee, Rwik Kumar Dutta, Debayan Ganguly et al.
In this paper, we propose a solution which uses state-of-the-art techniques in Deep Learning to tackle the problem of Bengali Handwritten Character Recognition ( HCR ). Our method uses lesser iterations to train than most other comparable methods. We employ Transfer Learning on ResNet 50, a state-of-the-art deep Convolutional Neural Network Model, pretrained on ImageNet dataset. We also use other techniques like a modified version of One Cycle Policy, varying the input image sizes etc. to ensure that our training occurs fast. We use the BanglaLekha-Isolated Dataset for evaluation of our technique which consists of 84 classes (50 Basic, 10 Numerals and 24 Compound Characters). We are able to achieve 96.12% accuracy in just 47 epochs on BanglaLekha-Isolated dataset. When comparing our method with that of other researchers, considering number of classes and without using Ensemble Learning, the proposed solution achieves state of the art result for Handwritten Bengali Character Recognition. Code and weight files are available at https://github.com/swagato-c/bangla-hwcr-present.
CYAug 15, 2024
Advance Real-time Detection of Traffic Incidents in Highways using Vehicle Trajectory DataSudipta Roy, Samiul Hasan
A significant number of traffic crashes are secondary crashes that occur because of an earlier incident on the road. Thus, early detection of traffic incidents is crucial for road users from safety perspectives with a potential to reduce the risk of secondary crashes. The wide availability of GPS devices now-a-days gives an opportunity of tracking and recording vehicle trajectories. The objective of this study is to use vehicle trajectory data for advance real-time detection of traffic incidents on highways using machine learning-based algorithms. The study uses three days of unevenly sequenced vehicle trajectory data and traffic incident data on I-10, one of the most crash-prone highways in Louisiana. Vehicle trajectories are converted to trajectories based on virtual detector locations to maintain spatial uniformity as well as to generate historical traffic data for machine learning algorithms. Trips matched with traffic incidents on the way are separated and along with other trips with similar spatial attributes are used to build a database for modeling. Multiple machine learning algorithms such as Logistic Regression, Random Forest, Extreme Gradient Boost, and Artificial Neural Network models are used to detect a trajectory that is likely to face an incident in the downstream road section. Results suggest that the Random Forest model achieves the best performance for predicting an incident with reasonable recall value and discrimination capability.
CVAug 11, 2025
MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt CustomizationAnkan Deria, Dwarikanath Mahapatra, Behzad Bozorgtabar et al.
Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.
CVJun 18, 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM CaptioningAnkan Deria, Adinath Madhavrao Dukre, Feilong Tang et al.
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
CVDec 16, 2013
A Review on Automated Brain Tumor Detection and Segmentation from MRI of BrainSudipta Roy, Sanjay Nag, Indra Kanta Maitra et al.
Tumor segmentation from magnetic resonance imaging (MRI) data is an important but time consuming manual task performed by medical experts. Automating this process is a challenging task because of the high diversity in the appearance of tumor tissues among different patients and in many cases similarity with the normal tissues. MRI is an advanced medical imaging technique providing rich information about the human soft-tissue anatomy. There are different brain tumor detection and segmentation methods to detect and segment a brain tumor from MRI images. These detection and segmentation approaches are reviewed with an importance placed on enlightening the advantages and drawbacks of these methods for brain tumor detection and segmentation. The use of MRI image detection and segmentation in different procedures are also described. Here a brief review of different segmentation for detection of brain tumor from MRI of brain has been discussed.
CVOct 20, 2012
Identifications of concealed weapon in a Human BodyProf. Samir K. Bandyopadhyay, Biswajita Datta, Sudipta Roy
The detection of weapons concealed underneath a person cloths is very much important to the improvement of the security of the public as well as the safety of public assets like airports, buildings and railway stations etc.
CVJan 25, 2012
A New Local Adaptive Thresholding Technique in BinarizationT. Romen Singh, Sudipta Roy, O. Imocha Singh et al.
Image binarization is the process of separation of pixel values into two groups, white as background and black as foreground. Thresholding plays a major in binarization of images. Thresholding can be categorized into global thresholding and local thresholding. In images with uniform contrast distribution of background and foreground like document images, global thresholding is more appropriate. In degraded document images, where considerable background noise or variation in contrast and illumination exists, there exists many pixels that cannot be easily classified as foreground or background. In such cases, binarization with local thresholding is more appropriate. This paper describes a locally adaptive thresholding technique that removes background by using local mean and mean deviation. Normally the local mean computational time depends on the window size. Our technique uses integral sum image as a prior processing to calculate local mean. It does not involve calculations of standard deviations as in other local adaptive techniques. This along with the fact that calculations of mean is independent of window size speed up the process as compared to other local thresholding techniques.