Snehasis Mukherjee

CV
h-index9
25papers
1,148citations
Novelty39%
AI Score53

25 Papers

CVMay 21Code
Balancing Uncertainty and Diversity of Samples: Leveraging Diversity of Least, High Confidence Samples for Effective Active Learning

Vipul Arya, S. H. Shabbeer Basha, Srikrishna U N et al.

Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model's performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at https://github.com/XXX/LCD.

CVFeb 14, 2023
WSD: Wild Selfie Dataset for Face Recognition in Selfie Images

Laxman Kumarapu, Shiv Ram Dubey, Snehasis Mukherjee et al.

With the rise of handy smart phones in the recent years, the trend of capturing selfie images is observed. Hence efficient approaches are required to be developed for recognising faces in selfie images. Due to the short distance between the camera and face in selfie images, and the different visual effects offered by the selfie apps, face recognition becomes more challenging with existing approaches. A dataset is needed to be developed to encourage the study to recognize faces in selfie images. In order to alleviate this problem and to facilitate the research on selfie face images, we develop a challenging Wild Selfie Dataset (WSD) where the images are captured from the selfie cameras of different smart phones, unlike existing datasets where most of the images are captured in controlled environment. The WSD dataset contains 45,424 images from 42 individuals (i.e., 24 female and 18 male subjects), which are divided into 40,862 training and 4,562 test images. The average number of images per subject is 1,082 with minimum and maximum number of images for any subject are 518 and 2,634, respectively. The proposed dataset consists of several challenges, including but not limited to augmented reality filtering, mirrored images, occlusion, illumination, scale, expressions, view-point, aspect ratio, blur, partial faces, rotation, and alignment. We compare the proposed dataset with existing benchmark datasets in terms of different characteristics. The complexity of WSD dataset is also observed experimentally, where the performance of the existing state-of-the-art face recognition methods is poor on WSD dataset, compared to the existing datasets. Hence, the proposed WSD dataset opens up new challenges in the area of face recognition and can be beneficial to the community to study the specific challenges related to selfie images and develop improved methods for face recognition in selfie images.

CVMay 21
Detection of Virus and Small Cell Patches in Foci Images Using Switchable Convolution and Feature Pyramid Networks

Amrita Singh, Snehasis Mukherjee

Accurate detection and counting of virus patches in focus-forming unit (FFU) images, also known as foci images, are important for quantifying viral infection and analyzing cellular structures. This task is challenging because biomedical targets often vary substantially in size, density, contrast, and shape. In this paper, we propose an enhanced YOLOv2-based detector that integrates a Feature Pyramid Network (FPN) to improve multi-scale feature representation. We also incorporate a switchable atrous convolution mechanism to adapt the receptive field for fine-grained targets in dense microscopy images. The proposed method is evaluated on biomedical foci image datasets for virus patch and small cell patch detection. For small cell patch detection, the model achieves a mean average precision (mAP) of 40.5% at a 25% Intersection over Union (IoU) threshold. For FFU virus patch detection, the model achieves an mAP of 68%. These results indicate that combining FPN-based feature fusion with switchable convolution improves the suitability of YOLOv2 for specialized biomedical object detection tasks

LGOct 16, 2021Code
DFW-PP: Dynamic Feature Weighting based Popularity Prediction for Social Media Content

Viswanatha Reddy G, Chaitanya B S N, Prathyush P et al.

The increasing popularity of social media platforms makes it important to study user engagement, which is a crucial aspect of any marketing strategy or business model. The over-saturation of content on social media platforms has persuaded us to identify the important factors that affect content popularity. This comes from the fact that only an iota of the humongous content available online receives the attention of the target audience. Comprehensive research has been done in the area of popularity prediction using several Machine Learning techniques. However, we observe that there is still significant scope for improvement in analyzing the social importance of media content. We propose the DFW-PP framework, to learn the importance of different features that vary over time. Further, the proposed method controls the skewness of the distribution of the features by applying a log-log normalization. The proposed method is experimented with a benchmark dataset, to show promising results. The code will be made publicly available at https://github.com/chaitnayabasava/DFW-PP.

CVApr 25, 2020Code
AutoTune: Automatically Tuning Convolutional Neural Networks for Improved Transfer Learning

S. H. Shabbeer Basha, Sravan Kumar Vinakota, Viswanath Pulabaigari et al.

Transfer learning enables solving a specific task having limited data by using the pre-trained deep networks trained on large-scale datasets. Typically, while transferring the learned knowledge from source task to the target task, the last few layers are fine-tuned (re-trained) over the target dataset. However, these layers are originally designed for the source task that might not be suitable for the target task. In this paper, we introduce a mechanism for automatically tuning the Convolutional Neural Networks (CNN) for improved transfer learning. The pre-trained CNN layers are tuned with the knowledge from target data using Bayesian Optimization. First, we train the final layer of the base CNN model by replacing the number of neurons in the softmax layer with the number of classes involved in the target task. Next, the pre-trained CNN is tuned automatically by observing the classification performance on the validation data (greedy criteria). To evaluate the performance of the proposed method, experiments are conducted on three benchmark datasets, e.g., CalTech-101, CalTech-256, and Stanford Dogs. The classification results obtained through the proposed AutoTune method outperforms the standard baseline transfer learning methods over the three datasets by achieving $95.92\%$, $86.54\%$, and $84.67\%$ accuracy over CalTech-101, CalTech-256, and Stanford Dogs, respectively. The experimental results obtained in this study depict that tuning of the pre-trained CNN layers with the knowledge from the target dataset confesses better transfer learning ability. The source codes are available at https://github.com/JekyllAndHyde8999/AutoTune_CNN_TransferLearning.

CVJan 22, 2020Code
AutoFCL: Automatically Tuning Fully Connected Layers for Handling Small Dataset

S. H. Shabbeer Basha, Sravan Kumar Vinakota, Shiv Ram Dubey et al.

Deep Convolutional Neural Networks (CNN) have evolved as popular machine learning models for image classification during the past few years, due to their ability to learn the problem-specific features directly from the input images. The success of deep learning models solicits architecture engineering rather than hand-engineering the features. However, designing state-of-the-art CNN for a given task remains a non-trivial and challenging task, especially when training data size is less. To address this phenomena, transfer learning has been used as a popularly adopted technique. While transferring the learned knowledge from one task to another, fine-tuning with the target-dependent Fully Connected (FC) layers generally produces better results over the target task. In this paper, the proposed AutoFCL model attempts to learn the structure of FC layers of a CNN automatically using Bayesian optimization. To evaluate the performance of the proposed AutoFCL, we utilize five pre-trained CNN models such as VGG-16, ResNet, DenseNet, MobileNet, and NASNetMobile. The experiments are conducted on three benchmark datasets, namely CalTech-101, Oxford-102 Flowers, and UC Merced Land Use datasets. Fine-tuning the newly learned (target-dependent) FC layers leads to state-of-the-art performance, according to the experiments carried out in this research. The proposed AutoFCL method outperforms the existing methods over CalTech-101 and Oxford-102 Flowers datasets by achieving the accuracy of 94.38% and 98.89%, respectively. However, our method achieves comparable performance on the UC Merced Land Use dataset with 96.83% accuracy. The source codes of this research are available at https://github.com/shabbeersh/AutoFCL.

LGSep 12, 2019Code
diffGrad: An Optimization Method for Convolutional Neural Networks

Shiv Ram Dubey, Soumendu Chakraborty, Swalpa Kumar Roy et al.

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad.

CVJan 21, 2019Code
Impact of Fully Connected Layers on Performance of Convolutional Neural Networks for Image Classification

S. H. Shabbeer Basha, Shiv Ram Dubey, Viswanath Pulabaigari et al.

The Convolutional Neural Networks (CNNs), in domains like computer vision, mostly reduced the need for handcrafted features due to its ability to learn the problem-specific features from the raw input data. However, the selection of dataset-specific CNN architecture, which mostly performed by either experience or expertise is a time-consuming and error-prone process. To automate the process of learning a CNN architecture, this paper attempts at finding the relationship between Fully Connected (FC) layers with some of the characteristics of the datasets. The CNN architectures, and recently datasets also, are categorized as deep, shallow, wide, etc. This paper tries to formalize these terms along with answering the following questions. (i) What is the impact of deeper/shallow architectures on the performance of the CNN w.r.t. FC layers?, (ii) How the deeper/wider datasets influence the performance of CNN w.r.t. FC layers?, and (iii) Which kind of architecture (deeper/ shallower) is better suitable for which kind of (deeper/ wider) datasets. To address these findings, we have performed experiments with three CNN architectures having different depths. The experiments are conducted by varying the number of FC layers. We used four widely used datasets including CIFAR-10, CIFAR-100, Tiny ImageNet, and CRCHistoPhenotypes to justify our findings in the context of the image classification problem. The source code of this research is available at https://github.com/shabbeersh/Impact-of-FC-layers.

CVSep 17, 2024
Scale-Invariant Object Detection by Adaptive Convolution with Unified Global-Local Context

Amrita Singh, Snehasis Mukherjee

Dense features are important for detecting minute objects in images. Unfortunately, despite the remarkable efficacy of the CNN models in multi-scale object detection, CNN models often fail to detect smaller objects in images due to the loss of dense features during the pooling process. Atrous convolution addresses this issue by applying sparse kernels. However, sparse kernels often can lose the multi-scale detection efficacy of the CNN model. In this paper, we propose an object detection model using a Switchable (adaptive) Atrous Convolutional Network (SAC-Net) based on the efficientDet model. A fixed atrous rate limits the performance of the CNN models in the convolutional layers. To overcome this limitation, we introduce a switchable mechanism that allows for dynamically adjusting the atrous rate during the forward pass. The proposed SAC-Net encapsulates the benefits of both low-level and high-level features to achieve improved performance on multi-scale object detection tasks, without losing the dense features. Further, we apply a depth-wise switchable atrous rate to the proposed network, to improve the scale-invariant features. Finally, we apply global context on the proposed model. Our extensive experiments on benchmark datasets demonstrate that the proposed SAC-Net outperforms the state-of-the-art models by a significant margin in terms of accuracy.

CVFeb 12, 2025
Plantation Monitoring Using Drone Images: A Dataset and Performance Review

Yashwanth Karumanchi, Gudala Laxmi Prasanna, Snehasis Mukherjee et al.

Automatic monitoring of tree plantations plays a crucial role in agriculture. Flawless monitoring of tree health helps farmers make informed decisions regarding their management by taking appropriate action. Use of drone images for automatic plantation monitoring can enhance the accuracy of the monitoring process, while still being affordable to small farmers in developing countries such as India. Small, low cost drones equipped with an RGB camera can capture high-resolution images of agricultural fields, allowing for detailed analysis of the well-being of the plantations. Existing methods of automated plantation monitoring are mostly based on satellite images, which are difficult to get for the farmers. We propose an automated system for plantation health monitoring using drone images, which are becoming easier to get for the farmers. We propose a dataset of images of trees with three categories: ``Good health", ``Stunted", and ``Dead". We annotate the dataset using CVAT annotation tool, for use in research purposes. We experiment with different well-known CNN models to observe their performance on the proposed dataset. The initial low accuracy levels show the complexity of the proposed dataset. Further, our study revealed that, depth-wise convolution operation embedded in a deep CNN model, can enhance the performance of the model on drone dataset. Further, we apply state-of-the-art object detection models to identify individual trees to better monitor them automatically.

CVSep 12, 2025
SCoDA: Self-supervised Continual Domain Adaptation

Chirayu Agrawal, Snehasis Mukherjee

Source-Free Domain Adaptation (SFDA) addresses the challenge of adapting a model to a target domain without access to the data of the source domain. Prevailing methods typically start with a source model pre-trained with full supervision and distill the knowledge by aligning instance-level features. However, these approaches, relying on cosine similarity over L2-normalized feature vectors, inadvertently discard crucial geometric information about the latent manifold of the source model. We introduce Self-supervised Continual Domain Adaptation (SCoDA) to address these limitations. We make two key departures from standard practice: first, we avoid the reliance on supervised pre-training by initializing the proposed framework with a teacher model pre-trained entirely via self-supervision (SSL). Second, we adapt the principle of geometric manifold alignment to the SFDA setting. The student is trained with a composite objective combining instance-level feature matching with a Space Similarity Loss. To combat catastrophic forgetting, the teacher's parameters are updated via an Exponential Moving Average (EMA) of the student's parameters. Extensive experiments on benchmark datasets demonstrate that SCoDA significantly outperforms state-of-the-art SFDA methods.

CVJul 25, 2025
Continual Learning-Based Unified Model for Unpaired Image Restoration Tasks

Kotha Kartheek, Lingamaneni Gnanesh Chowdary, Snehasis Mukherjee

Restoration of images contaminated by different adverse weather conditions such as fog, snow, and rain is a challenging task due to the varying nature of the weather conditions. Most of the existing methods focus on any one particular weather conditions. However, for applications such as autonomous driving, a unified model is necessary to perform restoration of corrupted images due to different weather conditions. We propose a continual learning approach to propose a unified framework for image restoration. The proposed framework integrates three key innovations: (1) Selective Kernel Fusion layers that dynamically combine global and local features for robust adaptive feature selection; (2) Elastic Weight Consolidation (EWC) to enable continual learning and mitigate catastrophic forgetting across multiple restoration tasks; and (3) a novel Cycle-Contrastive Loss that enhances feature discrimination while preserving semantic consistency during domain translation. Further, we propose an unpaired image restoration approach to reduce the dependance of the proposed approach on the training data. Extensive experiments on standard benchmark datasets for dehazing, desnowing and deraining tasks demonstrate significant improvements in PSNR, SSIM, and perceptual quality over the state-of-the-art.

CVJan 30, 2021
Deep Model Compression based on the Training History

S. H. Shabbeer Basha, Mohammad Farazuddin, Viswanath Pulabaigari et al.

Deep Convolutional Neural Networks (DCNNs) have shown promising performances in several visual recognition problems which motivated the researchers to propose popular architectures such as LeNet, AlexNet, VGGNet, ResNet, and many more. These architectures come at a cost of high computational complexity and parameter storage. To get rid of storage and computational complexity, deep model compression methods have been evolved. We propose a "History Based Filter Pruning (HBFP)" method that utilizes network training history for filter pruning. Specifically, we prune the redundant filters by observing similar patterns in the filter's L1-norms (absolute sum of weights) over the training epochs. We iteratively prune the redundant filters of a CNN in three steps. First, we train the model and select the filter pairs with redundant filters in each pair. Next, we optimize the network to ensure an increased measure of similarity between the filters in a pair. This optimization of the network facilitates us to prune one filter from each pair based on its importance without much information loss. Finally, we retrain the network to regain the performance, which is dropped due to filter pruning. We test our approach on popular architectures such as LeNet-5 on MNIST dataset; VGG-16, ResNet-56, and ResNet-110 on CIFAR-10 dataset, and ResNet-50 on ImageNet. The proposed pruning method outperforms the state-of-the-art in terms of FLOPs reduction (floating-point operations) by 97.98%, 83.42%, 78.43%, 74.95%, and 75.45% for LeNet-5, VGG-16, ResNet-56, ResNet-110, and ResNet-50, respectively, while maintaining the less error rate.

CVDec 7, 2020
MERANet: Facial Micro-Expression Recognition using 3D Residual Attention Network

Viswanatha Reddy Gajjala, Sai Prasanna Teja Reddy, Snehasis Mukherjee et al.

Micro-expression has emerged as a promising modality in affective computing due to its high objectivity in emotion detection. Despite the higher recognition accuracy provided by the deep learning models, there are still significant scope for improvements in micro-expression recognition techniques. The presence of micro-expressions in small-local regions of the face, as well as the limited size of available databases, continue to limit the accuracy in recognizing micro-expressions. In this work, we propose a facial micro-expression recognition model using 3D residual attention network named MERANet to tackle such challenges. The proposed model takes advantage of spatial-temporal attention and channel attention together, to learn deeper fine-grained subtle features for classification of emotions. Further, the proposed model encompasses both spatial and temporal information simultaneously using the 3D kernels and residual connections. Moreover, the channel features and spatio-temporal features are re-calibrated using the channel and spatio-temporal attentions, respectively in each residual module. Our attention mechanism enables the model to learn to focus on different facial areas of interest. The experiments are conducted on benchmark facial micro-expression datasets. A superior performance is observed as compared to the state-of-the-art for facial micro-expression recognition on benchmark data.

CVFeb 6, 2020
An Information-rich Sampling Technique over Spatio-Temporal CNN for Classification of Human Actions in Videos

S. H. Shabbeer Basha, Viswanath Pulabaigari, Snehasis Mukherjee

We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier. Traditionally in deep learning based human activity recognition approaches, either a few random frames or every $k^{th}$ frame of the video is considered for training the 3D CNN, where $k$ is a small positive integer, like 4, 5, or 6. This kind of sampling reduces the volume of the input data, which speeds-up training of the network and also avoids over-fitting to some extent, thus enhancing the performance of the 3D CNN model. In the proposed video sampling technique, consecutive $k$ frames of a video are aggregated into a single frame by computing a Gaussian-weighted summation of the $k$ frames. The resulting frame (aggregated frame) preserves the information in a better way than the conventional approaches and experimentally shown to perform better. In this paper, a 3D CNN architecture is proposed to extract the spatio-temporal features and follows Long Short-Term Memory (LSTM) to recognize human actions. The proposed 3D CNN architecture is capable of handling the videos where the camera is placed at a distance from the performer. Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.

CVAug 27, 2019
Visual Question Answering using Deep Learning: A Survey and Performance Analysis

Yash Srivastava, Vaishnav Murali, Shiv Ram Dubey et al.

The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic `common sense' questions about given images. Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and robustness of the machine-learning models. Next, we discuss about new deep learning models that have shown promising results over the VQA datasets. At the end, we present and discuss some of the results computed by us over the vanilla VQA model, Stacked Attention Network and the VQA Challenge 2017 winner model. We also provide the detailed analysis along with the challenges and future research directions.

CVApr 25, 2019
A Conditional Adversarial Network for Scene Flow Estimation

Ravi Kumar Thakur, Snehasis Mukherjee

The problem of Scene flow estimation in depth videos has been attracting attention of researchers of robot vision, due to its potential application in various areas of robotics. The conventional scene flow methods are difficult to use in reallife applications due to their long computational overhead. We propose a conditional adversarial network SceneFlowGAN for scene flow estimation. The proposed SceneFlowGAN uses loss function at two ends: both generator and descriptor ends. The proposed network is the first attempt to estimate scene flow using generative adversarial networks, and is able to estimate both the optical flow and disparity from the input stereo images simultaneously. The proposed method is experimented on a large RGB-D benchmark sceneflow dataset.

CVMar 27, 2019
Spontaneous Facial Micro-Expression Recognition using 3D Spatiotemporal Convolutional Neural Networks

Sai Prasanna Teja Reddy, Surya Teja Karri, Shiv Ram Dubey et al.

Facial expression recognition in videos is an active area of research in computer vision. However, fake facial expressions are difficult to be recognized even by humans. On the other hand, facial micro-expressions generally represent the actual emotion of a person, as it is a spontaneous reaction expressed through human face. Despite of a few attempts made for recognizing micro-expressions, still the problem is far from being a solved problem, which is depicted by the poor rate of accuracy shown by the state-of-the-art methods. A few CNN based approaches are found in the literature to recognize micro-facial expressions from still images. Whereas, a spontaneous micro-expression video contains multiple frames that have to be processed together to encode both spatial and temporal information. This paper proposes two 3D-CNN methods: MicroExpSTCNN and MicroExpFuseNet, for spontaneous facial micro-expression recognition by exploiting the spatiotemporal information in CNN framework. The MicroExpSTCNN considers the full spatial information, whereas the MicroExpFuseNet is based on the 3D-CNN feature fusion of the eyes and mouth regions. The experiments are performed over CAS(ME)^2 and SMIC micro-expression databases. The proposed MicroExpSTCNN model outperforms the state-of-the-art methods.

CVSep 30, 2018
RCCNet: An Efficient Convolutional Neural Network for Histological Routine Colon Cancer Nuclei Classification

S H Shabbeer Basha, Soumen Ghosh, Kancharagunta Kishan Babu et al.

Efficient and precise classification of histological cell nuclei is of utmost importance due to its potential applications in the field of medical image analysis. It would facilitate the medical practitioners to better understand and explore various factors for cancer treatment. The classification of histological cell nuclei is a challenging task due to the cellular heterogeneity. This paper proposes an efficient Convolutional Neural Network (CNN) based architecture for classification of histological routine colon cancer nuclei named as RCCNet. The main objective of this network is to keep the CNN model as simple as possible. The proposed RCCNet model consists of only 1,512,868 learnable parameters which are significantly less compared to the popular CNN models such as AlexNet, CIFARVGG, GoogLeNet, and WRN. The experiments are conducted over publicly available routine colon cancer histological dataset "CRCHistoPhenotypes". The results of the proposed RCCNet model are compared with five state-of-the-art CNN models in terms of the accuracy, weighted average F1 score and training time. The proposed method has achieved a classification accuracy of 80.61% and 0.7887 weighted average F1 score. The proposed RCCNet is more efficient and generalized terms of the training time and data over-fitting, respectively.

CVSep 30, 2018
A Multi-Face Challenging Dataset for Robust Face Recognition

Shiv Ram Dubey, Snehasis Mukherjee

Face recognition in images is an active area of interest among the computer vision researchers. However, recognizing human face in an unconstrained environment, is a relatively less-explored area of research. Multiple face recognition in unconstrained environment is a challenging task, due to the variation of view-point, scale, pose, illumination and expression of the face images. Partial occlusion of faces makes the recognition task even more challenging. The contribution of this paper is two-folds: introducing a challenging multiface dataset (i.e., IIITS MFace Dataset) for face recognition in unconstrained environment and evaluating the performance of state-of-the-art hand-designed and deep learning based face descriptors on the dataset. The proposed IIITS MFace dataset contains faces with challenges like pose variation, occlusion, mask, spectacle, expressions, change of illumination, etc. We experiment with several state-of-the-art face descriptors, including recent deep learning based face descriptors like VGGFace, and compare with the existing benchmark face datasets. Results of the experiments clearly show that the difficulty level of the proposed dataset is much higher compared to the benchmark datasets.

CVAug 26, 2018
Single Image Dehazing Based on Generic Regularity

Kushal Borkar, Snehasis Mukherjee

This paper proposes a novel technique for single image dehazing. Most of the state-of-the-art methods for single image dehazing relies either on Dark Channel Prior (DCP) or on Color line. The proposed method combines the two different approaches. We initially compute the dark channel prior and then apply a Nearest-Neighbor (NN) based regularization technique to obtain a smooth transmission map of the hazy image. We consider the effect of airlight on the image by using the color line model to assess the commitment of airlight in each patch of the image and interpolate at the local neighborhood where the estimate is unreliable. The NN based regularization of the DCP can remove the haze, whereas, the color line based interpolation of airlight effect makes the proposed system robust against the variation of haze within an image due to multiple light sources. The proposed method is tested on benchmark datasets and shows promising results compared to the state-of-the-art, including the deep learning based methods, which require a huge computational setup. Moreover, the proposed method outperforms the recent deep learning based methods when applied on images with sky regions.

CVJul 10, 2018
SceneEDNet: A Deep Learning Approach for Scene Flow Estimation

Ravi Kumar Thakur, Snehasis Mukherjee

Estimating scene flow in RGB-D videos is attracting much interest of the computer vision researchers, due to its potential applications in robotics. The state-of-the-art techniques for scene flow estimation, typically rely on the knowledge of scene structure of the frame and the correspondence between frames. However, with the increasing amount of RGB-D data captured from sophisticated sensors like Microsoft Kinect, and the recent advances in the area of sophisticated deep learning techniques, introduction of an efficient deep learning technique for scene flow estimation, is becoming important. This paper introduces a first effort to apply a deep learning method for direct estimation of scene flow by presenting a fully convolutional neural network with an encoder-decoder (ED) architecture. The proposed network SceneEDNet involves estimation of three dimensional motion vectors of all the scene points from sequence of stereo images. The training for direct estimation of scene flow is done using consecutive pairs of stereo images and corresponding scene flow ground truth. The proposed architecture is applied on a huge dataset and provides meaningful results.

CVJul 9, 2018
Human Activity Recognition in RGB-D Videos by Dynamic Images

Snehasis Mukherjee, Leburu Anvitha, T. Mohana Lahari

Human Activity Recognition in RGB-D videos has been an active research topic during the last decade. However, no efforts have been found in the literature, for recognizing human activity in RGB-D videos where several performers are performing simultaneously. In this paper we introduce such a challenging dataset with several performers performing the activities. We present a novel method for recognizing human activities in such videos. The proposed method aims in capturing the motion information of the whole video by producing a dynamic image corresponding to the input video. We use two parallel ResNext-101 to produce the dynamic images for the RGB video and depth video separately. The dynamic images contain only the motion information and hence, the unnecessary background information are eliminated. We send the two dynamic images extracted from the RGB and Depth videos respectively, through a fully connected layer of neural networks. The proposed dynamic image reduces the complexity of the recognition process by extracting a sparse matrix from a video. However, the proposed system maintains the required motion information for recognizing the activity. The proposed method has been tested on the MSR Action 3D dataset and has shown comparable performances with respect to the state-of-the-art. We also apply the proposed method on our own dataset, where the proposed method outperforms the state-of-the-art approaches.

CVApr 17, 2018
Automatic Assessment of Artistic Quality of Photos

Ashish Verma, Kranthi Koukuntla, Rohit Varma et al.

This paper proposes a technique to assess the aesthetic quality of photographs. The goal of the study is to predict whether a given photograph is captured by professional photographers, or by common people, based on a measurement of artistic quality of the photograph. We propose a Multi-Layer-Perceptron based system to analyze some low, mid and high level image features and find their effectiveness to measure artistic quality of the image and produce a measurement of the artistic quality of the image on a scale of 10. We validate the proposed system on a large dataset, containing images downloaded from the internet. The dataset contains some images captured by professional photographers and the rest of the images captured by common people. The proposed measurement of artistic quality of images provides higher value of photo quality for the images captured by professional photographers, compared to the values provided for the other images.

CVFeb 28, 2018
LDOP: Local Directional Order Pattern for Robust Face Retrieval

Shiv Ram Dubey, Snehasis Mukherjee

The local descriptors have gained wide range of attention due to their enhanced discriminative abilities. It has been proved that the consideration of multi-scale local neighborhood improves the performance of the descriptor, though at the cost of increased dimension. This paper proposes a novel method to construct a local descriptor using multi-scale neighborhood by finding the local directional order among the intensity values at different scales in a particular direction. Local directional order is the multi-radius relationship factor in a particular direction. The proposed local directional order pattern (LDOP) for a particular pixel is computed by finding the relationship between the center pixel and local directional order indexes. It is required to transform the center value into the range of neighboring orders. Finally, the histogram of LDOP is computed over whole image to construct the descriptor. In contrast to the state-of-the-art descriptors, the dimension of the proposed descriptor does not depend upon the number of neighbors involved to compute the order; it only depends upon the number of directions. The introduced descriptor is evaluated over the image retrieval framework and compared with the state-of-the-art descriptors over challenging face databases such as PaSC, LFW, PubFig, FERET, AR, AT&T, and ExtendedYale. The experimental results confirm the superiority and robustness of the LDOP descriptor.