Vinay Kumar

CV
h-index6
17papers
47citations
Novelty40%
AI Score47

17 Papers

CVNov 20, 2022
Audio-visual video face hallucination with frequency supervision and cross modality support by speech based lip reading loss

Shailza Sharma, Abhinav Dhall, Vinay Kumar et al.

Recently, there has been numerous breakthroughs in face hallucination tasks. However, the task remains rather challenging in videos in comparison to the images due to inherent consistency issues. The presence of extra temporal dimension in video face hallucination makes it non-trivial to learn the facial motion through out the sequence. In order to learn these fine spatio-temporal motion details, we propose a novel cross-modal audio-visual Video Face Hallucination Generative Adversarial Network (VFH-GAN). The architecture exploits the semantic correlation of between the movement of the facial structure and the associated speech signal. Another major issue in present video based approaches is the presence of blurriness around the key facial regions such as mouth and lips - where spatial displacement is much higher in comparison to other areas. The proposed approach explicitly defines a lip reading loss to learn the fine grain motion in these facial areas. During training, GANs have potential to fit frequencies from low to high, which leads to miss the hard to synthesize frequencies. Therefore, to add salient frequency features to the network we add a frequency based loss function. The visual and the quantitative comparison with state-of-the-art shows a significant improvement in performance and efficacy.

CVJul 9, 2022
Variational Approach for Intensity Domain Multi-exposure Image Fusion

Harbinder Singh, Dinesh Arora, Vinay Kumar

Recent innovations shows that blending of details captured by single Low Dynamic Range (LDR) sensor overcomes the limitations of standard digital cameras to capture details from high dynamic range scene. We present a method to produce well-exposed fused image that can be displayed directly on conventional display devices. The ambition is to preserve details in poorly illuminated and brightly illuminated regions. Proposed approach does not require true radiance reconstruction and tone manipulation steps. The aforesaid objective is achieved by taking into account local information measure that select well-exposed regions across input exposures. In addition, Contrast Limited Adaptive Histogram equalization (CLAHE) is introduced to improve uniformity of input multi-exposure image prior to fusion.

AIMay 9
Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery

Harshit Bisht, Vinay Kumar, Kevin Maik Jablonka et al.

A growing body of work pursues AI scientists capable of end-to-end autonomous scientific discovery. This position paper argues that although they already function as co-scientists, agentic AI scientists are not built for autonomous scientific discovery. We identify the following challenges in building and deploying autonomous AI scientists: (1) Problem selection is influenced by the McNamara fallacy; (2) Agents are built on large language models (LLMs) whose training corpora omit tacit procedural and failure knowledge of laboratory practice; (3) Preference optimisation during post-training compresses output diversity toward consensus; and (4) Most scientific benchmarks measure single-turn prediction accuracy and lack feedback from physical experiments back to the computational model. These challenges are not just questions of scale and scaffolding; they require revisiting fundamental design choices. To build truly autonomous AI scientists, we recommend the use of scientific simulations as verifiers for training, the design of persistent world models that represent the shifting objectives governing real investigations, the establishment of a centralized preregistration repository for all AI-generated hypotheses, and application driven by scientific need rather than tool affordance.

AIMay 9
MDGYM: Benchmarking AI Agents on Molecular Simulations

Vinay Kumar, Satyendra Rajput, Mausam et al.

The promise of AI-driven scientific discovery hinges on whether AI agents can autonomously design and execute the computational workflows that underpin modern science. Molecular dynamics (MD) simulation presents a natural test bed to stress-test this claim; it requires translating physical intuition into syntactically and semantically correct input scripts, reasoning about initial and boundary conditions, diagnosing numerically unstable trajectories, and interpreting outputs against known physical behavior and laws. We introduce MDGYM, a benchmark of 169 expert-curated MD simulations spanning LAMMPS and GROMACS, two widely used MD packages, across three increasing difficulty levels. We evaluate three agentic frameworks -- Claude Code, Codex, and OpenHands -- with four LLMs, and find that all perform poorly: even the strongest agent solves only 21\% of easy-level tasks, with less than 10\% at higher difficulties. Trajectory analysis reveals a characteristic pattern of failure -- agents successfully invoke simulation machinery but produce physically unstable configurations, fabricate numerical outputs without executing the underlying computation, or abandon tasks prematurely rather than iterating through simulation-specific errors. These failure modes are qualitatively distinct from those observed in general software engineering benchmarks, indicating that fluent code generation does not transfer to grounded physical reasoning.

AIMay 8
PLACO: A Multi-Stage Framework for Cost-Effective Performance in Human-AI Teams

Pranavkumar Mallela, Vinay Kumar, Shashi Shekhar Jha et al.

Human-AI teams play a pivotal role in improving overall system performance when neither the human nor the model can achieve such performance on their own. With the advent of powerful and accessible Generative AI models, several mundane tasks have morphed into Human-AI team tasks. From writing essays to developing advanced algorithms, humans have found that using AI assistance has led to an accelerated work pace like never before. In classification tasks, where the final output is a single hard label, it is crucial to address the combination of human and model output. Prior work elegantly solves this problem using Bayes rule, using the assumption that human and model output are conditionally independent given the ground truth. Specifically, it discusses a combination method to combine a single deterministic labeler (the human) and a probabilistic labeler (the classifier model) using the model's instance-level and the human's class-level calibrated probabilities.

SPAug 31, 2025
PyNoetic: A modular python framework for no-code development of EEG brain-computer interfaces

Gursimran Singh, Aviral Chharia, Rahul Upadhyay et al.

Electroencephalography (EEG)-based Brain-Computer Interfaces (BCIs) have emerged as a transformative technology with applications spanning robotics, virtual reality, medicine, and rehabilitation. However, existing BCI frameworks face several limitations, including a lack of stage-wise flexibility essential for experimental research, steep learning curves for researchers without programming expertise, elevated costs due to reliance on proprietary software, and a lack of all-inclusive features leading to the use of multiple external tools affecting research outcomes. To address these challenges, we present PyNoetic, a modular BCI framework designed to cater to the diverse needs of BCI research. PyNoetic is one of the very few frameworks in Python that encompasses the entire BCI design pipeline, from stimulus presentation and data acquisition to channel selection, filtering, feature extraction, artifact removal, and finally simulation and visualization. Notably, PyNoetic introduces an intuitive and end-to-end GUI coupled with a unique pick-and-place configurable flowchart for no-code BCI design, making it accessible to researchers with minimal programming experience. For advanced users, it facilitates the seamless integration of custom functionalities and novel algorithms with minimal coding, ensuring adaptability at each design stage. PyNoetic also includes a rich array of analytical tools such as machine learning models, brain-connectivity indices, systematic testing functionalities via simulation, and evaluation methods of novel paradigms. PyNoetic's strengths lie in its versatility for both offline and real-time BCI development, which streamlines the design process, allowing researchers to focus on more intricate aspects of BCI development and thus accelerate their research endeavors. Project Website: https://neurodiag.github.io/PyNoetic

CVMay 26, 2023
Pruning Distorted Images in MNIST Handwritten Digits

Amarnath R, Vinay Kumar

Recognizing handwritten digits is a challenging task primarily due to the diversity of writing styles and the presence of noisy images. The widely used MNIST dataset, which is commonly employed as a benchmark for this task, includes distorted digits with irregular shapes, incomplete strokes, and varying skew in both the training and testing datasets. Consequently, these factors contribute to reduced accuracy in digit recognition. To overcome this challenge, we propose a two-stage deep learning approach. In the first stage, we create a simple neural network to identify distorted digits within the training set. This model serves to detect and filter out such distorted and ambiguous images. In the second stage, we exclude these identified images from the training dataset and proceed to retrain the model using the filtered dataset. This process aims to improve the classification accuracy and confidence levels while mitigating issues of underfitting and overfitting. Our experimental results demonstrate the effectiveness of the proposed approach, achieving an accuracy rate of over 99.5% on the testing dataset. This significant improvement showcases the potential of our method in enhancing digit classification accuracy. In our future work, we intend to explore the scalability of this approach and investigate techniques to further enhance accuracy by reducing the size of the training data.

AINov 16, 2021
From Convolutions towards Spikes: The Environmental Metric that the Community currently Misses

Aviral Chharia, Shivu Chauhan, Rahul Upadhyay et al.

Today, the AI community is obsessed with 'state-of-the-art' scores (80% papers in NeurIPS) as the major performance metrics, due to which an important parameter, i.e., the environmental metric, remains unreported. Computational capabilities were a limiting factor a decade ago; however, in foreseeable future circumstances, the challenge will be to develop environment-friendly and power-efficient algorithms. The human brain, which has been optimizing itself for almost a million years, consumes the same amount of power as a typical laptop. Therefore, developing nature-inspired algorithms is one solution to it. In this study, we show that currently used ANNs are not what we find in nature, and why, although having lower performance, spiking neural networks, which mirror the mammalian visual cortex, have attracted much interest. We further highlight the hardware gaps restricting the researchers from using spike-based computation for developing neuromorphic energy-efficient microchips on a large scale. Using neuromorphic processors instead of traditional GPUs might be more environment friendly and efficient. These processors will turn SNNs into an ideal solution for the problem. This paper presents in-depth attention highlighting the current gaps, the lack of comparative research, while proposing new research directions at the intersection of two fields -- neuroscience and deep learning. Further, we define a new evaluation metric 'NATURE' for reporting the carbon footprint of AI models.

LGNov 10, 2021
Entropy optimized semi-supervised decomposed vector-quantized variational autoencoder model based on transfer learning for multiclass text classification and generation

Shivani Malhotra, Vinay Kumar, Alpana Agarwal

Semisupervised text classification has become a major focus of research over the past few years. Hitherto, most of the research has been based on supervised learning, but its main drawback is the unavailability of labeled data samples in practical applications. It is still a key challenge to train the deep generative models and learn comprehensive representations without supervision. Even though continuous latent variables are employed primarily in deep latent variable models, discrete latent variables, with their enhanced understandability and better compressed representations, are effectively used by researchers. In this paper, we propose a semisupervised discrete latent variable model for multi-class text classification and text generation. The proposed model employs the concept of transfer learning for training a quantized transformer model, which is able to learn competently using fewer labeled instances. The model applies decomposed vector quantization technique to overcome problems like posterior collapse and index collapse. Shannon entropy is used for the decomposed sub-encoders, on which a variable DropConnect is applied, to retain maximum information. Moreover, gradients of the Loss function are adaptively modified during backpropagation from decoder to encoder to enhance the performance of the model. Three conventional datasets of diversified range have been used for validating the proposed model on a variable number of labeled instances. Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.

CVOct 5, 2021
Frequency Aware Face Hallucination Generative Adversarial Network with Semantic Structural Constraint

Shailza Sharma, Abhinav Dhall, Vinay Kumar

In this paper, we address the issue of face hallucination. Most current face hallucination methods rely on two-dimensional facial priors to generate high resolution face images from low resolution face images. These methods are only capable of assimilating global information into the generated image. Still there exist some inherent problems in these methods; such as, local features, subtle structural details and missing depth information in final output image. Present work proposes a Generative Adversarial Network (GAN) based novel progressive Face Hallucination (FH) network to address these issues present among current methods. The generator of the proposed model comprises of FH network and two sub-networks, assisting FH network to generate high resolution images. The first sub-network leverages on explicitly adding high frequency components into the model. To explicitly encode the high frequency components, an auto encoder is proposed to generate high resolution coefficients of Discrete Cosine Transform (DCT). To add three dimensional parametric information into the network, second sub-network is proposed. This network uses a shape model of 3D Morphable Models (3DMM) to add structural constraint to the FH network. Extensive experimentation results in the paper shows that the proposed model outperforms the state-of-the-art methods.

CVSep 28, 2020
A complete character recognition and transliteration technique for Devanagari script

Jasmine Kaur, Vinay Kumar

Transliteration involves transformation of one script to another based on phonetic similarities between the characters of two distinctive scripts. In this paper, we present a novel technique for automatic transliteration of Devanagari script using character recognition. One of the first tasks performed to isolate the constituent characters is segmentation. Line segmentation methodology in this manuscript discusses the case of overlapping lines. Character segmentation algorithm is designed to segment conjuncts and separate shadow characters. Presented shadow character segmentation scheme employs connected component method to isolate the character, keeping the constituent characters intact. Statistical features namely different order moments like area, variance, skewness and kurtosis along with structural features of characters are employed in two phase recognition process. After recognition, constituent Devanagari characters are mapped to corresponding roman alphabets in way that resulting roman alphabets have similar pronunciation to source characters.

CVAug 26, 2019
Appearance invariant Entry-Exit matching using visual soft biometric traits

Vinay Kumar, P Nagabhushan

The problem of appearance invariant subject recognition for Entry-Exit surveillance applications is addressed. A novel Semantic Entry-Exit matching model that makes use of ancillary information about subjects such as height, build, complexion and clothing color to endorse exit of every subject who had entered private area is proposed in this paper. The proposed method is robust to variations in clothing. Each describing attribute is given equal weight while computing the matching score and hence the proposed model achieves high rank-k accuracy on benchmark datasets. The soft biometric traits used as a combination though cannot achieve high rank-1 accuracy, it helps to narrow down the search to match using reliable biometric traits such as gait and face whose learning and matching time is costlier when compared to the visual soft biometrics.

CVAug 2, 2019
Monitoring of people entering and exiting private areas using Computer Vision

Vinay Kumar, P Nagabhushan

Entry-Exit surveillance is a novel research problem that addresses security concerns when people attain absolute privacy in camera forbidden areas such as toilets and changing rooms that are basic amenities to the humans in public places such as Shopping malls, Airports, Bus and Rail stations. The objective is, if not inside these camera forbidden areas, from outside, the individuals are to be monitored to analyze the time spent by them inside and also the suspecting transformations in their appearances if any. In this paper, firstly, a pseudo-annotated dataset of a laboratory observation of people entering and exiting the camera forbidden area captured using two cameras in contrast to the state-of-the-art single-camera based EnEx dataset is presented. Conventionally the proposed dataset is named \textbf{\textit{EnEx2}}. Next, a spatial transition based event detection to determine the entry or exit of individuals is presented with standard results by evaluating the proposed model using the proposed dataset and the publicly available standard video surveillance datasets that are hypothesized to Entry-Exit surveillance scenarios. The proposed dataset is expected to enkindle active research in Entry-Exit Surveillance domain.

CVJan 17, 2019
No reference image quality assessment metric based on regional mutual information among images

Vinay Kumar, Vivek Singh Bawa, Rahul Upadhyay

With the inclusion of camera in daily life, an automatic no reference image quality evaluation index is required for automatic classification of images. The present manuscripts proposes a new No Reference Regional Mutual Information based technique for evaluating the quality of an image. We use regional mutual information on subsets of the complete image. Proposed technique is tested on four benchmark natural image databases, and one benchmark synthetic database. A comparative analysis with classical and state-of-art methods indicate superiority of the present technique for high quality images and comparable for other images of the respective databases.

NEApr 6, 2017
A Software-equivalent SNN Hardware using RRAM-array for Asynchronous Real-time Learning

Aditya Shukla, Vinay Kumar, Udayan Ganguly

Spiking Neural Network (SNN) naturally inspires hardware implementation as it is based on biology. For learning, spike time dependent plasticity (STDP) may be implemented using an energy efficient waveform superposition on memristor based synapse. However, system level implementation has three challenges. First, a classic dilemma is that recognition requires current reading for short voltage$-$spikes which is disturbed by large voltage$-$waveforms that are simultaneously applied on the same memristor for real$-$time learning i.e. the simultaneous read$-$write dilemma. Second, the hardware needs to exactly replicate software implementation for easy adaptation of algorithm to hardware. Third, the devices used in hardware simulations must be realistic. In this paper, we present an approach to address the above concerns. First, the learning and recognition occurs in separate arrays simultaneously in real$-$time, asynchronously $-$ avoiding non$-$biomimetic clocking based complex signal management. Second, we show that the hardware emulates software at every stage by comparison of SPICE (circuit$-$simulator) with MATLAB (mathematical SNN algorithm implementation in software) implementations. As an example, the hardware shows 97.5 per cent accuracy in classification which is equivalent to software for a Fisher$-$Iris dataset. Third, the STDP is implemented using a model of synaptic device implemented using HfO2 memristor. We show that an increasingly realistic memristor model slightly reduces the hardware performance (85 per cent), which highlights the need to engineer RRAM characteristics specifically for SNN.

MMJul 31, 2013
A simple technique for steganography

Adity Sharma, Anoo Agarwal, Vinay Kumar

A new technique for data hiding in digital image is proposed in this paper. Steganography is a well known technique for hiding data in an image, but generally the format of image plays a pivotal role in it, and the scheme is format dependent. In this paper we will discuss a new technique where irrespective of the format of image, we can easily hide a large amount of data without deteriorating the quality of the image. The data to be hidden is enciphered with the help of a secret key. This enciphered data is then embedded at the end of the image. The enciphered data bits are extracted and then deciphered with the help of same key used for encryption. Simulation results show that Image Quality Measures of this proposed scheme are better than the conventional LSB replacing technique. The proposed method is simple and is easy to implement.

MMJul 10, 2013
Anisotropic Diffusion for Details Enhancement in Multi-Exposure Image Fusion

Harbinder Singh, Vinay Kumar, Sunil Bhooshan

We develop a multiexposure image fusion method based on texture features, which exploits the edge preserving and intraregion smoothing property of nonlinear diffusion filters based on partial differential equations (PDE). With the captured multiexposure image series, we first decompose images into base layers and detail layers to extract sharp details and fine details, respectively. The magnitude of the gradient of the image intensity is utilized to encourage smoothness at homogeneous regions in preference to inhomogeneous regions. Then, we have considered texture features of the base layer to generate a mask (i.e., decision mask) that guides the fusion of base layers in multiresolution fashion. Finally, well-exposed fused image is obtained that combines fused base layer and the detail layers at each scale across all the input exposures. Proposed algorithm skipping complex High Dynamic Range Image (HDRI) generation and tone mapping steps to produce detail preserving image for display on standard dynamic range display devices. Moreover, our technique is effective for blending flash/no-flash image pair and multifocus images, that is, images focused on different targets.