Sudipta Paul

CV
h-index63
13papers
163citations
Novelty49%
AI Score37

13 Papers

CVOct 14, 2022
AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

Sudipta Paul, Amit K. Roy-Chowdhury, Anoop Cherian

Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN~ -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds.

CVJun 6, 2023
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

Xiulong Liu, Sudipta Paul, Moitreya Chatterjee et al.

Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: (i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that only use uni-directional interaction.

IVMar 8, 2024
C2P-GCN: Cell-to-Patch Graph Convolutional Network for Colorectal Cancer Grading

Sudipta Paul, Bulent Yener, Amanda W. Lund

Graph-based learning approaches, due to their ability to encode tissue/organ structure information, are increasingly favored for grading colorectal cancer histology images. Recent graph-based techniques involve dividing whole slide images (WSIs) into smaller or medium-sized patches, and then building graphs on each patch for direct use in training. This method, however, fails to capture the tissue structure information present in an entire WSI and relies on training from a significantly large dataset of image patches. In this paper, we propose a novel cell-to-patch graph convolutional network (C2P-GCN), which is a two-stage graph formation-based approach. In the first stage, it forms a patch-level graph based on the cell organization on each patch of a WSI. In the second stage, it forms an image-level graph based on a similarity measure between patches of a WSI considering each patch as a node of a graph. This graph representation is then fed into a multi-layer GCN-based classification network. Our approach, through its dual-phase graph construction, effectively gathers local structural details from individual patches and establishes a meaningful connection among all patches across a WSI. As C2P-GCN integrates the structural data of an entire WSI into a single graph, it allows our model to work with significantly fewer training data compared to the latest models for colorectal cancer. Experimental validation of C2P-GCN on two distinct colorectal cancer datasets demonstrates the effectiveness of our method.

CVNov 18, 2025
LINGUAL: Language-INtegrated GUidance in Active Learning for Medical Image Segmentation

Md Shazid Islam, Shreyangshu Bera, Sudipta Paul et al.

Although active learning (AL) in segmentation tasks enables experts to annotate selected regions of interest (ROIs) instead of entire images, it remains highly challenging, labor-intensive, and cognitively demanding due to the blurry and ambiguous boundaries commonly observed in medical images. Also, in conventional AL, annotation effort is a function of the ROI- larger regions make the task cognitively easier but incur higher annotation costs, whereas smaller regions demand finer precision and more attention from the expert. In this context, language guidance provides an effective alternative, requiring minimal expert effort while bypassing the cognitively demanding task of precise boundary delineation in segmentation. Towards this goal, we introduce LINGUAL: a framework that receives natural language instructions from an expert, translates them into executable programs through in-context learning, and automatically performs the corresponding sequence of sub-tasks without any human intervention. We demonstrate the effectiveness of LINGUAL in active domain adaptation (ADA) achieving comparable or superior performance to AL baselines while reducing estimated annotation time by approximately 80%.

CVDec 5, 2023
Repurposing SAM for User-Defined Semantics Aware Segmentation

Rohit Kundu, Sudipta Paul, Arindam Dutta et al.

The Segment Anything Model (SAM) excels at generating precise object masks from input prompts but lacks semantic awareness, failing to associate its generated masks with specific object categories. To address this limitation, we propose U-SAM, a novel framework that imbibes semantic awareness into SAM, enabling it to generate targeted masks for user-specified object categories. Given only object class names as input from the user, U-SAM provides pixel-level semantic annotations for images without requiring any labeled/unlabeled samples from the test data distribution. Our approach leverages synthetically generated or web crawled images to accumulate semantic information about the desired object classes. We then learn a mapping function between SAM's mask embeddings and object class labels, effectively enhancing SAM with granularity-specific semantic recognition capabilities. As a result, users can obtain meaningful and targeted segmentation masks for specific objects they request, rather than generic and unlabeled masks. We evaluate U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements of +17.95% and +5.20%, respectively, over state-of-the-art methods. By transforming SAM into a semantically aware segmentation model, U-SAM offers a practical and flexible solution for pixel-level annotation across diverse and unseen domains in a resource-constrained environment.

CVOct 18, 2020
Exploiting Context for Robustness to Label Noise in Active Learning

Sudipta Paul, Shivkumar Chandrasekaran, B. S. Manjunath et al.

Several works in computer vision have demonstrated the effectiveness of active learning for adapting the recognition model when new unlabeled data becomes available. Most of these works consider that labels obtained from the annotator are correct. However, in a practical scenario, as the quality of the labels depends on the annotator, some of the labels might be wrong, which results in degraded recognition performance. In this paper, we address the problems of i) how a system can identify which of the queried labels are wrong and ii) how a multi-class active learning system can be adapted to minimize the negative impact of label noise. Towards solving the problems, we propose a noisy label filtering based learning approach where the inter-relationship (context) that is quite common in natural data is utilized to detect the wrong labels. We construct a graphical representation of the unlabeled data to encode these relationships and obtain new beliefs on the graph when noisy labels are available. Comparing the new beliefs with the prior relational information, we generate a dissimilarity score to detect the incorrect labels and update the recognition model with correct labels which result in better recognition performance. This is demonstrated in three different applications: scene classification, activity classification, and document classification.

LGSep 13, 2020
FLaPS: Federated Learning and Privately Scaling

Sudipta Paul, Poushali Sengupta, Subhankar Mishra

Federated learning (FL) is a distributed learning process where the model (weights and checkpoints) is transferred to the devices that posses data rather than the classical way of transferring and aggregating the data centrally. In this way, sensitive data does not leave the user devices. FL uses the FedAvg algorithm, which is trained in the iterative model averaging way, on the non-iid and unbalanced distributed data, without depending on the data quantity. Some issues with the FL are, 1) no scalability, as the model is iteratively trained over all the devices, which amplifies with device drops; 2) security and privacy trade-off of the learning process still not robust enough and 3) overall communication efficiency and the cost are higher. To mitigate these challenges we present Federated Learning and Privately Scaling (FLaPS) architecture, which improves scalability as well as the security and privacy of the system. The devices are grouped into clusters which further gives better privacy scaled turn around time to finish a round of training. Therefore, even if a device gets dropped in the middle of training, the whole process can be started again after a definite amount of time. The data and model both are communicated using differentially private reports with iterative shuffling which provides a better privacy-utility trade-off. We evaluated FLaPS on MNIST, CIFAR10, and TINY-IMAGENET-200 dataset using various CNN models. Experimental results prove FLaPS to be an improved, time and privacy scaled environment having better and comparable after-learning-parameters with respect to the central and FL models.

CVAug 20, 2020
Text-based Localization of Moments in a Video Corpus

Sudipta Paul, Niluthpol Chowdhury Mithun, Amit K. Roy-Chowdhury

Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments, HMAN focuses on distinguishing inter-video global semantic concepts based on sentence queries. Qualitative and quantitative results on three benchmark text-based video moment retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions - demonstrate that our method achieves promising performance on the proposed task of temporal localization of moments in a corpus of videos.

CRAug 13, 2020
LAC : LSTM AUTOENCODER with Community for Insider Threat Detection

Sudipta Paul, Subhankar Mishra

The employees of any organization, institute, or industry, spend a significant amount of time on a computer network, where they develop their own routine of activities in the form of network transactions over a time period. Insider threat detection involves identifying deviations in the routines or anomalies which may cause harm to the organization in the form of data leaks and secrets sharing. If not automated, this process involves feature engineering for modeling human behavior which is a tedious and time-consuming task. Anomalies in human behavior are forwarded to a human analyst for final threat classification. We developed an unsupervised deep neural network model using LSTM AUTOENCODER which learns to mimic the behavior of individual employees from their day-wise time-stamped sequence of activities. It predicts the threat scenario via significant loss from anomalous routine. Employees in a community tend to align their routine with each other rather than the employees outside their communities, this motivates us to explore a variation of the AUTOENCODER, LSTM AUTOENCODER- trained on the interleaved sequences of activities in the Community (LAC). We evaluate the model on the CERT v6.2 dataset and perform analysis on the loss for normal and anomalous routine across 4000 employees. The aim of our paper is to detect the anomalous employees as well as to explore how the surrounding employees are affecting that employees' routine over time.

CVJul 19, 2020
Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency

Shasha Li, Shitong Zhu, Sudipta Paul et al.

There has been a recent surge in research on adversarial perturbations that defeat Deep Neural Networks (DNNs) in machine vision; most of these perturbation-based attacks target object classifiers. Inspired by the observation that humans are able to recognize objects that appear out of place in a scene or along with other unlikely objects, we augment the DNN with a system that learns context consistency rules during training and checks for the violations of the same during testing. Our approach builds a set of auto-encoders, one for each object class, appropriately trained so as to output a discrepancy between the input and output if an added adversarial perturbation violates context consistency rules. Experiments on PASCAL VOC and MS COCO show that our method effectively detects various adversarial attacks and achieves high ROC-AUC (over 0.95 in most cases); this corresponds to over 20% improvement over a state-of-the-art context-agnostic method.

CRJun 10, 2020
Learning With Differential Privacy

Poushali Sengupta, Sudipta Paul, Subhankar Mishra

The leakage of data might have been an extreme effect on the personal level if it contains sensitive information. Common prevention methods like encryption-decryption, endpoint protection, intrusion detection system are prone to leakage. Differential privacy comes to the rescue with a proper promise of protection against leakage, as it uses a randomized response technique at the time of collection of the data which promises strong privacy with better utility. Differential privacy allows one to access the forest of data by describing their pattern of groups without disclosing any individual trees. The current adaption of differential privacy by leading tech companies and academia encourages authors to explore the topic in detail. The different aspects of differential privacy, it's application in privacy protection and leakage of information, a comparative discussion, on the current research approaches in this field, its utility in the real world as well as the trade-offs - will be discussed.

LGJun 7, 2020
BUDS: Balancing Utility and Differential Privacy by Shuffling

Poushali Sengupta, Sudipta Paul, Subhankar Mishra

Balancing utility and differential privacy by shuffling or \textit{BUDS} is an approach towards crowd-sourced, statistical databases, with strong privacy and utility balance using differential privacy theory. Here, a novel algorithm is proposed using one-hot encoding and iterative shuffling with the loss estimation and risk minimization techniques, to balance both the utility and privacy. In this work, after collecting one-hot encoded data from different sources and clients, a step of novel attribute shuffling technique using iterative shuffling (based on the query asked by the analyst) and loss estimation with an updation function and risk minimization produces a utility and privacy balanced differential private report. During empirical test of balanced utility and privacy, BUDS produces $ε= 0.02$ which is a very promising result. Our algorithm maintains a privacy bound of $ε= ln [t/((n_1 - 1)^S)]$ and loss bound of $c' \bigg|e^{ln[t/((n_1 - 1)^S)]} - 1\bigg|$.

CRJan 6, 2020
ARA : Aggregated RAPPOR and Analysis for Centralized Differential Privacy

Sudipta Paul, Subhankar Mishra

Differential privacy(DP) has now become a standard in case of sensitive statistical data analysis. The two main approaches in DP is local and central. Both the approaches have a clear gap in terms of data storing,amount of data to be analyzed, analysis, speed etc. Local wins on the speed. We have tested the state of the art standard RAPPOR which is a local approach and supported this gap. Our work completely focuses on that part too. Here, we propose a model which initially collects RAPPOR reports from multiple clients which are then pushed to a Tf-Idf estimation model. The Tf-Idf estimation model then estimates the reports on the basis of the occurrence of "on bit" in a particular position and its contribution to that position. Thus it generates a centralized differential privacy analysis from multiple clients. Our model successfully and efficiently analyzed the major truth value every time.