48.9CVMay 30
T-CLIP: Enabling Thermal Perception for Contrastive Language-Image PretrainingTayeba Qazi, Ayush Maheshwari, Prerana Mukherjee et al.
Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.
CVAug 13, 2024
A Comprehensive Survey on Synthetic Infrared Image synthesisAvinash Upadhyay, Manoj sharma, Prerana Mukherjee et al.
Synthetic infrared (IR) scene and target generation is an important computer vision problem as it allows the generation of realistic IR images and targets for training and testing of various applications, such as remote sensing, surveillance, and target recognition. It also helps reduce the cost and risk associated with collecting real-world IR data. This survey paper aims to provide a comprehensive overview of the conventional mathematical modelling-based methods and deep learning-based methods used for generating synthetic IR scenes and targets. The paper discusses the importance of synthetic IR scene and target generation and briefly covers the mathematics of blackbody and grey body radiations, as well as IR image-capturing methods. The potential use cases of synthetic IR scenes and target generation are also described, highlighting the significance of these techniques in various fields. Additionally, the paper explores possible new ways of developing new techniques to enhance the efficiency and effectiveness of synthetic IR scenes and target generation while highlighting the need for further research to advance this field.
CVApr 25, 2022
OCFormer: One-Class Transformer Network for Image ClassificationPrerana Mukherjee, Chandan Kumar Roy, Swalpa Kumar Roy
We propose a novel deep learning framework based on Vision Transformers (ViT) for one-class classification. The core idea is to use zero-centered Gaussian noise as a pseudo-negative class for latent space representation and then train the network using the optimal loss function. In prior works, there have been tremendous efforts to learn a good representation using varieties of loss functions, which ensures both discriminative and compact properties. The proposed one-class Vision Transformer (OCFormer) is exhaustively experimented on CIFAR-10, CIFAR-100, Fashion-MNIST and CelebA eyeglasses datasets. Our method has shown significant improvements over competing CNN based one-class classifier approaches.
CVJan 9, 2022Code
MaskMTL: Attribute prediction in masked facial images with deep multitask learningPrerana Mukherjee, Vinay Kaushik, Ronak Gupta et al.
Predicting attributes in the landmark free facial images is itself a challenging task which gets further complicated when the face gets occluded due to the usage of masks. Smart access control gates which utilize identity verification or the secure login to personal electronic gadgets may utilize face as a biometric trait. Particularly, the Covid-19 pandemic increasingly validates the essentiality of hygienic and contactless identity verification. In such cases, the usage of masks become more inevitable and performing attribute prediction helps in segregating the target vulnerable groups from community spread or ensuring social distancing for them in a collaborative environment. We create a masked face dataset by efficiently overlaying masks of different shape, size and textures to effectively model variability generated by wearing mask. This paper presents a deep Multi-Task Learning (MTL) approach to jointly estimate various heterogeneous attributes from a single masked facial image. Experimental results on benchmark face attribute UTKFace dataset demonstrate that the proposed approach supersedes in performance to other competing techniques. The source code is available at https://github.com/ritikajha/Attribute-prediction-in-masked-facial-images-with-deep-multitask-learning
CLMay 24, 2021Code
Hater-O-Genius Aggression Classification using Capsule NetworksParth Patwa, Srinivas PYKL, Amitava Das et al.
Contending hate speech in social media is one of the most challenging social problems of our time. There are various types of anti-social behavior in social media. Foremost of them is aggressive behavior, which is causing many social issues such as affecting the social lives and mental health of social media users. In this paper, we propose an end-to-end ensemble-based architecture to automatically identify and classify aggressive tweets. Tweets are classified into three categories - Covertly Aggressive, Overtly Aggressive, and Non-Aggressive. The proposed architecture is an ensemble of smaller subnetworks that are able to characterize the feature embeddings effectively. We demonstrate qualitatively that each of the smaller subnetworks is able to learn unique features. Our best model is an ensemble of Capsule Networks and results in a 65.2% F1 score on the Facebook test set, which results in a performance gain of 0.95% over the TRAC-2018 winners. The code and the model weights are publicly available at https://github.com/parthpatwa/Hater-O-Genius-Aggression-Classification-using-Capsule-Networks.
CVJan 27, 2022
ASOC: Adaptive Self-aware Object Co-localizationKoteswar Rao Jerripothula, Prerana Mukherjee
The primary goal of this paper is to localize objects in a group of semantically similar images jointly, also known as the object co-localization problem. Most related existing works are essentially weakly-supervised, relying prominently on the neighboring images' weak-supervision. Although weak supervision is beneficial, it is not entirely reliable, for the results are quite sensitive to the neighboring images considered. In this paper, we combine it with a self-awareness phenomenon to mitigate this issue. By self-awareness here, we refer to the solution derived from the image itself in the form of saliency cue, which can also be unreliable if applied alone. Nevertheless, combining these two paradigms together can lead to a better co-localization ability. Specifically, we introduce a dynamic mediator that adaptively strikes a proper balance between the two static solutions to provide an optimal solution. Therefore, we call this method \textit{ASOC}: Adaptive Self-aware Object Co-localization. We perform exhaustive experiments on several benchmark datasets and validate that weak-supervision supplemented with self-awareness has superior performance outperforming several compared competing methods.
CVDec 9, 2020
Generating Out of Distribution Adversarial Attack using Latent Space PoisoningUjjwal Upadhyay, Prerana Mukherjee
Traditional adversarial attacks rely upon the perturbations generated by gradients from the network which are generally safeguarded by gradient guided search to provide an adversarial counterpart to the network. In this paper, we propose a novel mechanism of generating adversarial examples where the actual image is not corrupted rather its latent space representation is utilized to tamper with the inherent structure of the image while maintaining the perceptual quality intact and to act as legitimate data samples. As opposed to gradient-based attacks, the latent space poisoning exploits the inclination of classifiers to model the independent and identical distribution of the training dataset and tricks it by producing out of distribution samples. We train a disentangled variational autoencoder (beta-VAE) to model the data in latent space and then we add noise perturbations using a class-conditioned distribution function to the latent space under the constraint that it is misclassified to the target label. Our empirical results on MNIST, SVHN, and CelebA dataset validate that the generated adversarial examples can easily fool robust l_0, l_2, l_inf norm classifiers designed using provably robust defense mechanisms.
MMAug 28, 2020
Semantics Preserving Hierarchy based Retrieval of Indian heritage monumentsRonak Gupta, Prerana Mukherjee, Brejesh Lall et al.
Monument classification can be performed on the basis of their appearance and shape from coarse to fine categories. Although there is much semantic information present in the monuments which is reflected in the eras they were built, its type or purpose, the dynasty which established it, etc. Particularly, Indian subcontinent exhibits a huge deal of variation in terms of architectural styles owing to its rich cultural heritage. In this paper, we propose a framework that utilizes hierarchy to preserve semantic information while performing image classification or image retrieval. We encode the learnt deep embeddings to construct a dictionary of images and then utilize a re-ranking framework on the the retrieved results using DeLF features. The semantic information preserved in these embeddings helps to classify unknown monuments at higher level of granularity in hierarchy. We have curated a large, novel Indian heritage monuments dataset comprising of images of historical, cultural and religious importance with subtypes of eras, dynasties and architectural styles. We demonstrate the performance of the proposed framework in image classification and retrieval tasks and compare it with other competing methods on this dataset.
ASFeb 6, 2020
Attentional networks for music generationGullapalli Keerti, A N Vaishnavi, Prerana Mukherjee et al.
Realistic music generation has always remained as a challenging problem as it may lack structure or rationality. In this work, we propose a deep learning based music generation method in order to produce old style music particularly JAZZ with rehashed melodic structures utilizing a Bi-directional Long Short Term Memory (Bi-LSTM) Neural Network with Attention. Owing to the success in modelling long-term temporal dependencies in sequential data and its success in case of videos, Bi-LSTMs with attention serve as the natural choice and early utilization in music generation. We validate in our experiments that Bi-LSTMs with attention are able to preserve the richness and technical nuances of the music performed.
GRFeb 6, 2020
AnimePose: Multi-person 3D pose estimation and animationLaxman Kumarapu, Prerana Mukherjee
3D animation of humans in action is quite challenging as it involves using a huge setup with several motion trackers all over the person's body to track the movements of every limb. This is time-consuming and may cause the person discomfort in wearing exoskeleton body suits with motion sensors. In this work, we present a trivial yet effective solution to generate 3D animation of multiple persons from a 2D video using deep learning. Although significant improvement has been achieved recently in 3D human pose estimation, most of the prior works work well in case of single person pose estimation and multi-person pose estimation is still a challenging problem. In this work, we firstly propose a supervised multi-person 3D pose estimation and animation framework namely AnimePose for a given input RGB video sequence. The pipeline of the proposed system consists of various modules: i) Person detection and segmentation, ii) Depth Map estimation, iii) Lifting 2D to 3D information for person localization iv) Person trajectory prediction and human pose tracking. Our proposed system produces comparable results on previous state-of-the-art 3D multi-person pose estimation methods on publicly available datasets MuCo-3DHP and MuPoTS-3D datasets and it also outperforms previous state-of-the-art human pose tracking methods by a significant margin of 11.7% performance gain on MOTA score on Posetrack 2018 dataset.
ASOct 18, 2019
Indian EmoSpeech Command Dataset: A dataset for emotion based speech recognition in the wildSubham Banga, Ujjwal Upadhyay, Piyush Agarwal et al.
Speech emotion analysis is an important task which further enables several application use cases. The non-verbal sounds within speech utterances also play a pivotal role in emotion analysis in speech. Due to the widespread use of smartphones, it becomes viable to analyze speech commands captured using microphones for emotion understanding by utilizing on-device machine learning models. The non-verbal information includes the environment background sounds describing the type of surroundings, current situation and activities being performed. In this work, we consider both verbal (speech commands) and non-verbal sounds (background noises) within an utterance for emotion analysis in real-life scenarios. We create an indigenous dataset for this task namely "Indian EmoSpeech Command Dataset". It contains keywords with diverse emotions and background sounds, presented to explore new challenges in audio analysis. We exhaustively compare with various baseline models for emotion analysis on speech commands on several performance metrics. We demonstrate that we achieve a significant average gain of 3.3% in top-one score over a subset of speech command dataset for keyword spotting.
CVSep 4, 2019
Aerial multi-object tracking by detection using deep association networksAjit Jadhav, Prerana Mukherjee, Vinay Kaushik et al.
A lot a research is focused on object detection and it has achieved significant advances with deep learning techniques in recent years. Inspite of the existing research, these algorithms are not usually optimal for dealing with sequences or images captured by drone-based platforms, due to various challenges such as view point change, scales, density of object distribution and occlusion. In this paper, we develop a model for detection of objects in drone images using the VisDrone2019 DET dataset. Using the RetinaNet model as our base, we modify the anchor scales to better handle the detection of dense distribution and small size of the objects. We explicitly model the channel interdependencies by using "Squeeze-and-Excitation" (SE) blocks that adaptively recalibrates channel-wise feature responses. This helps to bring significant improvements in performance at a slight additional computational cost. Using this architecture for object detection, we build a custom DeepSORT network for object detection on the VisDrone2019 MOT dataset by training a custom Deep Association network for the algorithm.
CVSep 3, 2019
Multi-level Attention network using text, audio and video for Depression PredictionAnupama Ray, Siddharth Kumar, Rutvik Reddy et al.
Depression has been the leading cause of mental-health illness worldwide. Major depressive disorder (MDD), is a common mental health disorder that affects both psychologically as well as physically which could lead to loss of lives. Due to the lack of diagnostic tests and subjectivity involved in detecting depression, there is a growing interest in using behavioural cues to automate depression diagnosis and stage prediction. The absence of labelled behavioural datasets for such problems and the huge amount of variations possible in behaviour makes the problem more challenging. This paper presents a novel multi-level attention based network for multi-modal depression prediction that fuses features from audio, video and text modalities while learning the intra and inter modality relevance. The multi-level attention reinforces overall learning by selecting the most influential features within each modality for the decision making. We perform exhaustive experimentation to create different regression models for audio, video and text modalities. Several fusions models with different configurations are constructed to understand the impact of each feature and modality. We outperform the current baseline by 17.52% in terms of root mean squared error.
CVJul 9, 2019
A Light weight and Hybrid Deep Learning Model based Online Signature VerificationChandra Sekhar V., Anoushka Doctor, Prerana Mukherjee et al.
The augmented usage of deep learning-based models for various AI related problems are as a result of modern architectures of deeper length and the availability of voluminous interpreted datasets. The models based on these architectures require huge training and storage cost, which makes them inefficient to use in critical applications like online signature verification (OSV) and to deploy in resource constraint devices. As a solution, in this work, our contribution is two-fold. 1) An efficient dimensionality reduction technique, to reduce the number of features to be considered and 2) a state-of-the-art model CNN-LSTM based hybrid architecture for online signature verification. Thorough experiments on the publicly available datasets MCYT, SUSIG, SVC confirms that the proposed model achieves better accuracy even with as low as one training sample. The proposed models yield state-of-the-art performance in various categories of all the three datasets.
CVMay 21, 2019
Online Signature Verification Based on Writer Specific Feature Selection and Fuzzy Similarity MeasureChandra Sekhar, Prerana Mukherjee, D. S. Guru et al.
Online Signature Verification (OSV) is a widely used biometric attribute for user behavioral characteristic verification in digital forensics. In this manuscript, owing to large intra-individual variability, a novel method for OSV based on an interval symbolic representation and a fuzzy similarity measure grounded on writer specific parameter selection is proposed. The two parameters, namely, writer specific acceptance threshold and optimal feature set to be used for authenticating the writer are selected based on minimum equal error rate (EER) attained during parameter fixation phase using the training signature samples. This is in variation to current techniques for OSV, which are primarily writer independent, in which a common set of features and acceptance threshold are chosen. To prove the robustness of our system, we have exhaustively assessed our system with four standard datasets i.e. MCYT-100 (DB1), MCYT-330 (DB2), SUSIG-Visual corpus and SVC-2004- Task2. Experimental outcome confirms the effectiveness of fuzzy similarity metric-based writer dependent parameter selection for OSV by achieving a lower error rate as compared to many recent and state-of-the art OSV models.
CVApr 8, 2019
VayuAnukulani: Adaptive Memory Networks for Air Pollution ForecastingDivyam Madaan, Radhika Dua, Prerana Mukherjee et al.
Air pollution is the leading environmental health hazard globally due to various sources which include factory emissions, car exhaust and cooking stoves. As a precautionary measure, air pollution forecast serves as the basis for taking effective pollution control measures, and accurate air pollution forecasting has become an important task. In this paper, we forecast fine-grained ambient air quality information for 5 prominent locations in Delhi based on the historical and real-time ambient air quality and meteorological data reported by Central Pollution Control board. We present VayuAnukulani system, a novel end-to-end solution to predict air quality for next 24 hours by estimating the concentration and level of different air pollutants including nitrogen dioxide ($NO_2$), particulate matter ($PM_{2.5}$ and $PM_{10}$) for Delhi. Extensive experiments on data sources obtained in Delhi demonstrate that the proposed adaptive attention based Bidirectional LSTM Network outperforms several baselines for classification and regression models. The accuracy of the proposed adaptive system is $\sim 15 - 20\%$ better than the same offline trained model. We compare the proposed methodology on several competing baselines, and show that the network outperforms conventional methods by $\sim 3 - 5 \%$.
CVApr 2, 2019
DSAL-GAN: Denoising based Saliency Prediction with Generative Adversarial NetworksPrerana Mukherjee, Manoj Sharma, Megh Makwana et al.
Synthesizing high quality saliency maps from noisy images is a challenging problem in computer vision and has many practical applications. Samples generated by existing techniques for saliency detection cannot handle the noise perturbations smoothly and fail to delineate the salient objects present in the given scene. In this paper, we present a novel end-to-end coupled Denoising based Saliency Prediction with Generative Adversarial Network (DSAL-GAN) framework to address the problem of salient object detection in noisy images. DSAL-GAN consists of two generative adversarial-networks (GAN) trained end-to-end to perform denoising and saliency prediction altogether in a holistic manner. The first GAN consists of a generator which denoises the noisy input image, and in the discriminator counterpart we check whether the output is a denoised image or ground truth original image. The second GAN predicts the saliency maps from raw pixels of the input denoised image using a data-driven metric based on saliency prediction method with adversarial loss. Cycle consistency loss is also incorporated to further improve salient region prediction. We demonstrate with comprehensive evaluation that the proposed framework outperforms several baseline saliency models on various performance benchmarks.
CVMar 30, 2019
OSVNet: Convolutional Siamese Network for Writer Independent Online Signature VerificationChandra Sekhar, Prerana Mukherjee, Devanur S Guru et al.
Online signature verification (OSV) is one of the most challenging tasks in writer identification and digital forensics. Owing to the large intra-individual variability, there is a critical requirement to accurately learn the intra-personal variations of the signature to achieve higher classification accuracy. To achieve this, in this paper, we propose an OSV framework based on deep convolutional Siamese network (DCSN). DCSN automatically extracts robust feature descriptions based on metric-based loss function which decreases intra-writer variability (Genuine-Genuine) and increases inter-individual variability (Genuine-Forgery) and directs the DCSN for effective discriminative representation learning for online signatures and extend it for one shot learning framework. Comprehensive experimentation conducted on three widely accepted benchmark datasets MCYT-100 (DB1), MCYT-330 (DB2) and SVC-2004-Task2 demonstrate the capability of our framework to distinguish the genuine and forgery samples. Experimental results confirm the efficiency of deep convolutional Siamese network based OSV by achieving a lower error rate as compared to many recent and state-of-the art OSV techniques.
CVDec 13, 2018
Nrityantar: Pose oblivious Indian classical dance sequence classification systemVinay Kaushik, Prerana Mukherjee, Brejesh Lall
In this paper, we attempt to advance the research work done in human action recognition to a rather specialized application namely Indian Classical Dance (ICD) classification. The variation in such dance forms in terms of hand and body postures, facial expressions or emotions and head orientation makes pose estimation an extremely challenging task. To circumvent this problem, we construct a pose-oblivious shape signature which is fed to a sequence learning framework. The pose signature representation is done in two-fold process. First, we represent person-pose in first frame of a dance video using symmetric Spatial Transformer Networks (STN) to extract good person object proposals and CNN-based parallel single person pose estimator (SPPE). Next, the pose basis are converted to pose flows by assigning a similarity score between successive poses followed by non-maximal suppression. Instead of feeding a simple chain of joints in the sequence learner which generally hinders the network performance we constitute a feature vector of the normalized distance vectors, flow, angles between anchor joints which captures the adjacency configuration in the skeletal pattern. Thus, the kinematic relationship amongst the body joints across the frames using pose estimation helps in better establishing the spatio-temporal dependencies. We present an exhaustive empirical evaluation of state-of-the-art deep network based methods for dance classification on ICD dataset.
CVMar 7, 2018
Object cosegmentation using deep Siamese networkPrerana Mukherjee, Brejesh Lall, Snehith Lattupally
Object cosegmentation addresses the problem of discovering similar objects from multiple images and segmenting them as foreground simultaneously. In this paper, we propose a novel end-to-end pipeline to segment the similar objects simultaneously from relevant set of images using supervised learning via deep-learning framework. We experiment with multiple set of object proposal generation techniques and perform extensive numerical evaluations by training the Siamese network with generated object proposals. Similar objects proposals for the test images are retrieved using the ANNOY (Approximate Nearest Neighbor) library and deep semantic segmentation is performed on them. Finally, we form a collage from the segmented similar objects based on the relative importance of the objects.
CVDec 4, 2017
Enhanced Characterness for Text Detection in the WildAarushi Agrawal, Prerana Mukherjee, Siddharth Srivastava et al.
Text spotting is an interesting research problem as text may appear at any random place and may occur in various forms. Moreover, ability to detect text opens the horizons for improving many advanced computer vision problems. In this paper, we propose a novel language agnostic text detection method utilizing edge enhanced Maximally Stable Extremal Regions in natural scenes by defining strong characterness measures. We show that a simple combination of characterness cues help in rejecting the non text regions. These regions are further fine-tuned for rejecting the non-textual neighbor regions. Comprehensive evaluation of the proposed scheme shows that it provides comparative to better generalization performance to the traditional methods for this task.
CVDec 4, 2017
Object Classification using Ensemble of Local and Deep FeaturesSiddharth Srivastava, Prerana Mukherjee, Brejesh Lall et al.
In this paper we propose an ensemble of local and deep features for object classification. We also compare and contrast effectiveness of feature representation capability of various layers of convolutional neural network. We demonstrate with extensive experiments for object classification that the representation capability of features from deep networks can be complemented with information captured from local features. We also find out that features from various deep convolutional networks encode distinctive characteristic information. We establish that, as opposed to conventional practice, intermediate layers of deep networks can augment the classification capabilities of features obtained from fully connected layers.
CVJun 14, 2017
SalProp: Salient object proposals via aggregated edge cuesPrerana Mukherjee, Brejesh Lall, Sarvaswa Tandon
In this paper, we propose a novel object proposal generation scheme by formulating a graph-based salient edge classification framework that utilizes the edge context. In the proposed method, we construct a Bayesian probabilistic edge map to assign a saliency value to the edgelets by exploiting low level edge features. A Conditional Random Field is then learned to effectively combine these features for edge classification with object/non-object label. We propose an objectness score for the generated windows by analyzing the salient edge density inside the bounding box. Extensive experiments on PASCAL VOC 2007 dataset demonstrate that the proposed method gives competitive performance against 10 popular generic object detection techniques while using fewer number of proposals.
CVMay 20, 2015
Benchmarking KAZE and MCM for Multiclass ClassificationSiddharth Srivastava, Prerana Mukherjee, Brejesh Lall
In this paper, we propose a novel approach for feature generation by appropriately fusing KAZE and SIFT features. We then use this feature set along with Minimal Complexity Machine(MCM) for object classification. We show that KAZE and SIFT features are complementary. Experimental results indicate that an elementary integration of these techniques can outperform the state-of-the-art approaches.