Swagat Kumar

RO
h-index28
20papers
308citations
Novelty40%
AI Score42

20 Papers

CVMar 18Code
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan et al.

Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

CVJan 31, 2023
CRC-RL: A Novel Visual Feature Representation Architecture for Unsupervised Reinforcement Learning

Darshita Jain, Anima Majumder, Samrat Dutta et al.

This paper addresses the problem of visual feature representation learning with an aim to improve the performance of end-to-end reinforcement learning (RL) models. Specifically, a novel architecture is proposed that uses a heterogeneous loss function, called CRC loss, to learn improved visual features which can then be used for policy learning in RL. The CRC-loss function is a combination of three individual loss functions, namely, contrastive, reconstruction and consistency loss. The feature representation is learned in parallel to the policy learning while sharing the weight updates through a Siamese Twin encoder model. This encoder model is augmented with a decoder network and a feature projection network to facilitate computation of the above loss components. Through empirical analysis involving latent feature visualization, an attempt is made to provide an insight into the role played by this loss function in learning new action-dependent features and how they are linked to the complexity of the problems being solved. The proposed architecture, called CRC-RL, is shown to outperform the existing state-of-the-art methods on the challenging Deep mind control suite environments by a significant margin thereby creating a new benchmark in this field.

QUANT-PHJun 4, 2025
RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations

Swagat Kumar, Jan-Nico Zaech, Colin Michael Wilmott et al.

Variational Quantum Algorithms (VQAs) are a promising approach to leverage Noisy Intermediate-Scale Quantum (NISQ) computers. However, choosing optimal quantum circuits that efficiently solve a given VQA problem is a non-trivial task. Quantum Architecture Search (QAS) algorithms enable automatic generation of quantum circuits tailored to the provided problem. Existing QAS approaches typically adapt classical neural architecture search techniques, training machine learning models to sample relevant circuits, but often overlook the inherent quantum nature of the circuits they produce. By reformulating QAS from a quantum perspective, we propose a sampling-free differentiable QAS algorithm that models the search process as the evolution of a quantum mixed state, which emerges from the search space of quantum circuits. The mixed state formulation also enables our method to incorporate generic noise models, for example the depolarizing channel, which cannot be modeled by state vector simulation. We validate our method by finding circuits for state initialization and Hamiltonian optimization tasks, namely the variational quantum eigensolver and the unweighted max-cut problems. We show our approach to be comparable to, if not outperform, existing QAS techniques while requiring significantly fewer quantum simulations during training, and also show improved robustness levels to noise.

ROJan 11, 2022
Benchmarking Deep Reinforcement Learning Algorithms for Vision-based Robotics

Swagat Kumar, Hayden Sampson, Ardhendu Behera

This paper presents a benchmarking study of some of the state-of-the-art reinforcement learning algorithms used for solving two simulated vision-based robotics problems. The algorithms considered in this study include soft actor-critic (SAC), proximal policy optimization (PPO), interpolated policy gradients (IPG), and their variants with Hindsight Experience replay (HER). The performances of these algorithms are compared against PyBullet's two simulation environments known as KukaDiverseObjectEnv and RacecarZEDGymEnv respectively. The state observations in these environments are available in the form of RGB images and the action space is continuous, making them difficult to solve. A number of strategies are suggested to provide intermediate hindsight goals required for implementing HER algorithm on these problems which are essentially single-goal environments. In addition, a number of feature extraction architectures are proposed to incorporate spatial and temporal attention in the learning process. Through rigorous simulation experiments, the improvement achieved with these components are established. To the best of our knowledge, such a benchmarking study is not available for the above two vision-based robotics problems making it a novel contribution in the field.

LGMay 17, 2021
Controlling an Inverted Pendulum with Policy Gradient Methods-A Tutorial

Swagat Kumar

This paper provides the details of implementing two important policy gradient methods to solve the inverted pendulum problem. These are namely the Deep Deterministic Policy Gradient (DDPG) and the Proximal Policy Optimization (PPO) algorithm. The problem is solved by using an actor-critic model where an actor-network is used to learn the policy function and a critic network is to evaluate the actor-network by learning to estimate the Q function. Apart from briefly explaining the mathematics behind these two algorithms, the details of python implementation are provided which helps in demystifying the underlying complexity of the algorithm. In the process, the readers will be introduced to OpenAI/Gym, Tensorflow 2.x and Keras utilities used for implementing the above concepts.

CVJan 17, 2021
Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture Recognition

Ardhendu Behera, Zachary Wharton, Morteza Ghahremani et al.

Affect is often expressed via non-verbal body language such as actions/gestures, which are vital indicators for human behaviors. Recent studies on recognition of fine-grained actions/gestures in monocular images have mainly focused on modeling spatial configuration of body parts representing body pose, human-objects interactions and variations in local appearance. The results show that this is a brittle approach since it relies on accurate body parts/objects detection. In this work, we argue that there exist local discriminative semantic regions, whose "informativeness" can be evaluated by the attention mechanism for inferring fine-grained gestures/actions. To this end, we propose a novel end-to-end \textbf{Regional Attention Network (RAN)}, which is a fully Convolutional Neural Network (CNN) to combine multiple contextual regions through attention mechanism, focusing on parts of the images that are most relevant to a given task. Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor. The model is extensively evaluated on ten datasets belonging to 3 different scenarios: 1) head pose recognition, 2) drivers state recognition, and 3) human action and facial expression recognition. The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.

RONov 21, 2020
Chitrakar: Robotic System for Drawing Jordan Curve of Facial Portrait

Aniruddha Singhal, Ayush Kumar, Shivam Thukral et al.

This paper presents a robotic system (\textit{Chitrakar}) which autonomously converts any image of a human face to a recognizable non-self-intersecting loop (Jordan Curve) and draws it on any planar surface. The image is processed using Mask R-CNN for instance segmentation, Laplacian of Gaussian (LoG) for feature enhancement and intensity-based probabilistic stippling for the image to points conversion. These points are treated as a destination for a travelling salesman and are connected with an optimal path which is calculated heuristically by minimizing the total distance to be travelled. This path is converted to a Jordan Curve in feasible time by removing intersections using a combination of image processing, 2-opt, and Bresenham's Algorithm. The robotic system generates $n$ instances of each image for human aesthetic judgement, out of which the most appealing instance is selected for the final drawing. The drawing is executed carefully by the robot's arm using trapezoidal velocity profiles for jerk-free and fast motion. The drawing, with a decent resolution, can be completed in less than 30 minutes which is impossible to do by hand. This work demonstrates the use of robotics to augment humans in executing difficult craft-work instead of replacing them altogether.

LGOct 11, 2020
A Method for Handling Multi-class Imbalanced Data by Geometry based Information Sampling and Class Prioritized Synthetic Data Generation (GICaPS)

Anima Majumder, Samrat Dutta, Swagat Kumar et al.

This paper looks into the problem of handling imbalanced data in a multi-label classification problem. The problem is solved by proposing two novel methods that primarily exploit the geometric relationship between the feature vectors. The first one is an undersampling algorithm that uses angle between feature vectors to select more informative samples while rejecting the less informative ones. A suitable criterion is proposed to define the informativeness of a given sample. The second one is an oversampling algorithm that uses a generative algorithm to create new synthetic data that respects all class boundaries. This is achieved by finding \emph{no man's land} based on Euclidean distance between the feature vectors. The efficacy of the proposed methods is analyzed by solving a generic multi-class recognition problem based on mixture of Gaussians. The superiority of the proposed algorithms is established through comparison with other state-of-the-art methods, including SMOTE and ADASYN, over ten different publicly available datasets exhibiting high-to-extreme data imbalance. These two methods are combined into a single data processing framework and is labeled as ``GICaPS'' to highlight the role of geometry-based information (GI) sampling and Class-Prioritized Synthesis (CaPS) in dealing with multi-class data imbalance problem, thereby making a novel contribution in this field.

ROOct 3, 2020
Unsupervised Monocular Depth Estimation for Night-time Images using Adversarial Domain Feature Adaptation

Madhu Vankadari, Sourav Garg, Anima Majumder et al.

In this paper, we look into the problem of estimating per-pixel depth maps from unconstrained RGB monocular night-time images which is a difficult task that has not been addressed adequately in the literature. The state-of-the-art day-time depth estimation methods fail miserably when tested with night-time images due to a large domain shift between them. The usual photo metric losses used for training these networks may not work for night-time images due to the absence of uniform lighting which is commonly present in day-time images, making it a difficult problem to solve. We propose to solve this problem by posing it as a domain adaptation problem where a network trained with day-time images is adapted to work for night-time images. Specifically, an encoder is trained to generate features from night-time images that are indistinguishable from those obtained from day-time images by using a PatchGAN-based adversarial discriminative learning method. Unlike the existing methods that directly adapt depth prediction (network output), we propose to adapt feature maps obtained from the encoder network so that a pre-trained day-time depth decoder can be directly used for predicting depth from these adapted features. Hence, the resulting method is termed as "Adversarial Domain Feature Adaptation (ADFA)" and its efficacy is demonstrated through experimentation on the challenging Oxford night driving dataset. Also, The modular encoder-decoder architecture for the proposed ADFA method allows us to use the encoder module as a feature extractor which can be used in many other applications. One such application is demonstrated where the features obtained from our adapted encoder network are shown to outperform other state-of-the-art methods in a visual place recognition problem, thereby, further establishing the usefulness and effectiveness of the proposed approach.

AIJul 1, 2020
A Generalized Reinforcement Learning Algorithm for Online 3D Bin-Packing

Richa Verma, Aniruddha Singhal, Harshad Khadilkar et al.

We propose a Deep Reinforcement Learning (Deep RL) algorithm for solving the online 3D bin packing problem for an arbitrary number of bins and any bin size. The focus is on producing decisions that can be physically implemented by a robotic loading arm, a laboratory prototype used for testing the concept. The problem considered in this paper is novel in two ways. First, unlike the traditional 3D bin packing problem, we assume that the entire set of objects to be packed is not known a priori. Instead, a fixed number of upcoming objects is visible to the loading system, and they must be loaded in the order of arrival. Second, the goal is not to move objects from one point to another via a feasible path, but to find a location and orientation for each object that maximises the overall packing efficiency of the bin(s). Finally, the learnt model is designed to work with problem instances of arbitrary size without retraining. Simulation results show that the RL-based method outperforms state-of-the-art online bin packing heuristics in terms of empirical competitive ratio and volume efficiency.

ROJun 8, 2020
Balancing a CartPole System with Reinforcement Learning -- A Tutorial

Swagat Kumar

In this paper, we provide the details of implementing various reinforcement learning (RL) algorithms for controlling a Cart-Pole system. In particular, we describe various RL concepts such as Q-learning, Deep Q Networks (DQN), Double DQN, Dueling networks, (prioritized) experience replay and show their effect on the learning performance. In the process, the readers will be introduced to OpenAI/Gym and Keras utilities used for implementing the above concepts. It is observed that DQN with PER provides best performance among all other architectures being able to solve the problem within 150 episodes.

ROFeb 20, 2019
Look No Deeper: Recognizing Places from Opposing Viewpoints under Varying Scene Appearance using Single-View Depth Estimation

Sourav Garg, Madhu Babu, Thanuja Dharmasiri et al.

Visual place recognition (VPR) - the act of recognizing a familiar visual place - becomes difficult when there is extreme environmental appearance change or viewpoint change. Particularly challenging is the scenario where both phenomena occur simultaneously, such as when returning for the first time along a road at night that was previously traversed during the day in the opposite direction. While such problems can be solved with panoramic sensors, humans solve this problem regularly with limited field of view vision and without needing to constantly turn around. In this paper, we present a new depth- and temporal-aware visual place recognition system that solves the opposing viewpoint, extreme appearance-change visual place recognition problem. Our system performs sequence-to-single matching by extracting depth-filtered keypoints using a state-of-the-art depth estimation pipeline, constructing a keypoint sequence over multiple frames from the reference dataset, and comparing those keypoints to those in a single query image. We evaluate the system on a challenging benchmark dataset and show that it consistently outperforms state-of-the-art techniques. We also develop a range of diagnostic simulation experiments that characterize the contribution of depth-filtered keypoint sequences with respect to key domain parameters including degree of appearance change and camera motion.

CVNov 27, 2018
UnDEMoN 2.0: Improved Depth and Ego Motion Estimation through Deep Image Sampling

Madhu Babu, Swagat Kumar, Anima Majumder et al.

In this paper, we provide an improved version of UnDEMoN model for depth and ego motion estimation from monocular images. The improvement is achieved by combining the standard bi-linear sampler with a deep network based image sampling model (DIS-NET) to provide better image reconstruction capabilities on which the depth estimation accuracy depends in un-supervised learning models. While DIS-NET provides higher order regression and larger input search space, the bi-linear sampler provides geometric constraints necessary for reducing the size of the solution space for an ill-posed problem of this kind. This combination is shown to provide significant improvement in depth and pose estimation accuracy outperforming all existing state-of-the-art methods in this category. In addition, the modified network uses far less number of tunable parameters making it one of the lightest deep network model for depth estimation. The proposed model is labeled as "UnDEMoN 2.0" indicating an improvement over the existing UnDEMoN model. The efficacy of the proposed model is demonstrated through rigorous experimental analysis on the standard KITTI dataset.

ROSep 4, 2018
Optimized edge-based grasping method for a cluttered environment

Olyvia Kundu, Swagat Kumar

This paper looks into the problem of grasping region localization along with suitable pose from a cluttered environment without any a priori knowledge of the object geometry. This end-to-end method detects the handles from a single frame of input sensor. The pipeline starts with the creation of multiple surface segments to detect the required gap in the first stage, and eventually helps in detecting boundary lines. Our novelty lies in the fact that we have merged color based edge and depth edge in order to get more reliable boundary points through which a pair of boundary line is fitted. Also this information is used to validate the handle by measuring the angle between the boundary lines and also by checking for amy potential occlusion. In addition, we also proposed an optimizing cost function based method to choose the best handle from a set of valid handles. The method proposed is tested on real-life datasets and is found to out form state of the art methods in terms of precision.

CVAug 27, 2018
A Deeper Insight into the UnDEMoN: Unsupervised Deep Network for Depth and Ego-Motion Estimation

Madhu Babu, Anima Majumder, Kaushik Das et al.

This paper presents an unsupervised deep learning framework called UnDEMoN for estimating dense depth map and 6-DoF camera pose information directly from monocular images. The proposed network is trained using unlabeled monocular stereo image pairs and is shown to provide superior performance in depth and ego-motion estimation compared to the existing state-of-the-art. These improvements are achieved by introducing a new objective function that aims to minimize spatial as well as temporal reconstruction losses simultaneously. These losses are defined using bi-linear sampling kernel and penalized using the Charbonnier penalty function. The objective function, thus created, provides robustness to image gradient noises thereby improving the overall estimation accuracy without resorting to any coarse to fine strategies which are currently prevalent in the literature. Another novelty lies in the fact that we combine a disparity-based depth estimation network with a pose estimation network to obtain absolute scale-aware 6 DOF Camera pose and superior depth map. The effectiveness of the proposed approach is demonstrated through performance comparison with the existing supervised and unsupervised methods on the KITTI driving dataset.

ROJul 27, 2018
A Novel Geometry-based Algorithm for Robust Grasping in Extreme Clutter Environment

Olyvia Kundu, Swagat Kumar

This paper looks into the problem of grasping unknown objects in a cluttered environment using 3D point cloud data obtained from a range or an RGBD sensor. The objective is to identify graspable regions and detect suitable grasp poses from a single view, possibly, partial 3D point cloud without any apriori knowledge of the object geometry. The problem is solved in two steps: (1) identifying and segmenting various object surfaces and, (2) searching for suitable grasping handles on these surfaces by applying geometric constraints of the physical gripper. The first step is solved by using a modified version of region growing algorithm that uses a pair of thresholds for smoothness constraint on local surface normals to find natural boundaries of object surfaces. In this process, a novel concept of edge point is introduced that allows us to segment between different surfaces of the same object. The second step is solved by converting a 6D pose detection problem into a 1D linear search problem by projecting 3D cloud points onto the principal axes of the object surface. The graspable handles are then localized by applying physical constraints of the gripper. The resulting method allows us to grasp all kinds of objects including rectangular or box-type objects with flat surfaces which have been difficult so far to deal with in the grasping literature. The proposed method is simple and can be implemented in real-time and does not require any off-line training phase for finding these affordances. The improvements achieved is demonstrated through comparison with another state-of-the-art grasping algorithm on various publicly-available and self-created datasets.

ROJun 27, 2017
Managing a Fleet of Autonomous Mobile Robots (AMR) using Cloud Robotics Platform

Aniruddha Singhal, Nishant Kejriwal, Prasun Pallav et al.

In this paper, we provide details of implementing a system for managing a fleet of autonomous mobile robots (AMR) operating in a factory or a warehouse premise. While the robots are themselves autonomous in its motion and obstacle avoidance capability, the target destination for each robot is provided by a global planner. The global planner and the ground vehicles (robots) constitute a multi agent system (MAS) which communicate with each other over a wireless network. Three different approaches are explored for implementation. The first two approaches make use of the distributed computing based Networked Robotics architecture and communication framework of Robot Operating System (ROS) itself while the third approach uses Rapyuta Cloud Robotics framework for this implementation. The comparative performance of these approaches are analyzed through simulation as well as real world experiment with actual robots. These analyses provide an in-depth understanding of the inner working of the Cloud Robotics Platform in contrast to the usual ROS framework. The insight gained through this exercise will be valuable for students as well as practicing engineers interested in implementing similar systems else where. In the process, we also identify few critical limitations of the current Rapyuta platform and provide suggestions to overcome them.

ROJun 22, 2017
Improving Condition- and Environment-Invariant Place Recognition with Semantic Place Categorization

Sourav Garg, Adam Jacobson, Swagat Kumar et al.

The place recognition problem comprises two distinct subproblems; recognizing a specific location in the world ("specific" or "ordinary" place recognition) and recognizing the type of place (place categorization). Both are important competencies for mobile robots and have each received significant attention in the robotics and computer vision community, but usually as separate areas of investigation. In this paper, we leverage the powerful complementary nature of place recognition and place categorization processes to create a new hybrid place recognition system that uses place context to inform place recognition. We show that semantic place categorization creates an informative natural segmenting of physical space that in turn enables significantly better place recognition performance in comparison to existing techniques. In particular, this new semantically informed approach adds robustness to significant local changes within the environment, such as transitioning between indoor and outdoor environments or between dark and light rooms in a house, complementing the capabilities of current condition-invariant techniques that are robust to globally consistent change (such as day to night cycles). We perform experiments using 4 benchmark and new datasets and show that semantically-informed place recognition outperforms the previous state-of-the-art systems. Like it does for object recognition [1], we believe that semantics can play a key role in boosting conventional place recognition and navigation performance for robotic systems.

ROMar 7, 2017
Design and Development of an automated Robotic Pick & Stow System for an e-Commerce Warehouse

Swagat Kumar, Anima Majumder, Samrat Dutta et al.

In this paper, we provide details of a robotic system that can automate the task of picking and stowing objects from and to a rack in an e-commerce fulfillment warehouse. The system primarily comprises of four main modules: (1) Perception module responsible for recognizing query objects and localizing them in the 3-dimensional robot workspace; (2) Planning module generates necessary paths that the robot end- effector has to take for reaching the objects in the rack or in the tote; (3) Calibration module that defines the physical workspace for the robot visible through the on-board vision system; and (4) Gripping and suction system for picking and stowing different kinds of objects. The perception module uses a faster region-based Convolutional Neural Network (R-CNN) to recognize objects. We designed a novel two finger gripper that incorporates pneumatic valve based suction effect to enhance its ability to pick different kinds of objects. The system was developed by IITK-TCS team for participation in the Amazon Picking Challenge 2016 event. The team secured a fifth place in the stowing task in the event. The purpose of this article is to share our experiences with students and practicing engineers and enable them to build similar systems. The overall efficacy of the system is demonstrated through several simulation as well as real-world experiments with actual robots.

CVJan 25, 2015
An Occlusion Reasoning Scheme for Monocular Pedestrian Tracking in Dynamic Scenes

Sourav Garg, Swagat Kumar, Rajesh Ratnakaram et al.

This paper looks into the problem of pedestrian tracking using a monocular, potentially moving, uncalibrated camera. The pedestrians are located in each frame using a standard human detector, which are then tracked in subsequent frames. This is a challenging problem as one has to deal with complex situations like changing background, partial or full occlusion and camera motion. In order to carry out successful tracking, it is necessary to resolve associations between the detected windows in the current frame with those obtained from the previous frame. Compared to methods that use temporal windows incorporating past as well as future information, we attempt to make decision on a frame-by-frame basis. An occlusion reasoning scheme is proposed to resolve the association problem between a pair of consecutive frames by using an affinity matrix that defines the closeness between a pair of windows and then, uses a binary integer programming to obtain unique association between them. A second stage of verification based on SURF matching is used to deal with those cases where the above optimization scheme might yield wrong associations. The efficacy of the approach is demonstrated through experiments on several standard pedestrian datasets.