Hemant Kumar Kathania

h-index13

9papers

453citations

Novelty37%

AI Score38

Ranked #84,511 of 194,257 authors (top 44%)#430 in AS (top 30%)

9 Papers

13.5CVSep 1, 2024Code

ResEmoteNet: Bridging Accuracy and Loss Reduction in Facial Emotion Recognition

Arnab Kumar Roy, Hemant Kumar Kathania, Adhitiya Sharma et al.

The human face is a silent communicator, expressing emotions and thoughts through its facial expressions. With the advancements in computer vision in recent years, facial emotion recognition technology has made significant strides, enabling machines to decode the intricacies of facial cues. In this work, we propose ResEmoteNet, a novel deep learning architecture for facial emotion recognition designed with the combination of Convolutional, Squeeze-Excitation (SE) and Residual Networks. The inclusion of SE block selectively focuses on the important features of the human face, enhances the feature representation and suppresses the less relevant ones. This helps in reducing the loss and enhancing the overall model performance. We also integrate the SE block with three residual blocks that help in learning more complex representation of the data through deeper layers. We evaluated ResEmoteNet on four open-source databases: FER2013, RAF-DB, AffectNet-7 and ExpW, achieving accuracies of 79.79%, 94.76%, 72.39% and 75.67% respectively. The proposed network outperforms state-of-the-art models across all four databases. The source code for ResEmoteNet is available at https://github.com/ArnabKumarRoy02/ResEmoteNet.

2.0CVNov 16, 2024Code

Improvement in Facial Emotion Recognition using Synthetic Data Generated by Diffusion Model

Arnab Kumar Roy, Hemant Kumar Kathania, Adhitiya Sharma

Facial Emotion Recognition (FER) plays a crucial role in computer vision, with significant applications in human-computer interaction, affective computing, and areas such as mental health monitoring and personalized learning environments. However, a major challenge in FER task is the class imbalance commonly found in available datasets, which can hinder both model performance and generalization. In this paper, we tackle the issue of data imbalance by incorporating synthetic data augmentation and leveraging the ResEmoteNet model to enhance the overall performance on facial emotion recognition task. We employed Stable Diffusion 2 and Stable Diffusion 3 Medium models to generate synthetic facial emotion data, augmenting the training sets of the FER2013 and RAF-DB benchmark datasets. Training ResEmoteNet with these augmented datasets resulted in substantial performance improvements, achieving accuracies of 96.47% on FER2013 and 99.23% on RAF-DB. These findings shows an absolute improvement of 16.68% in FER2013, 4.47% in RAF-DB and highlight the efficacy of synthetic data augmentation in strengthening FER models and underscore the potential of advanced generative models in FER research and applications. The source code for ResEmoteNet is available at https://github.com/ArnabKumarRoy02/ResEmoteNet

3.7ASJun 20

DSSCNet: A Transfer Learning Framework for Cross-Corpus Dysarthric Speech Severity Classification

Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota et al.

Dysarthric speech severity classification is challenging due to speaker variability, class imbalance, and limited datasets. This study introduces DSSCNet, a deep learning model that employs transfer learning and multi-corpus learning to enhance speaker-independent classification. By pre-training on one dysarthric speech corpus and fine-tuning on another, DSSCNet achieves improved feature extraction and cross-corpus generalization. Experimental results demonstrate that DSSCNet outperforms state-of-the-art models for speaker-independent severity classification, achieving 75.80\% accuracy on TORGO and 68.25\% on UA-Speech, significantly reducing misclassification errors. The findings confirm that leveraging knowledge transfer between datasets improves model robustness, making DSSCNet well-suited for automated dysarthria assessment. This research contributes to the development of more effective assistive speech technologies for individuals with speech impairments.

3.2ASJun 20

How Well Do Self-Supervised Speech Models Encode Age and Gender in Children's Speech? A Layer-Wise Analysis Across Multiple Architectures

Abhijit Sinha, Hemant Kumar Kathania, Mohit Joshi et al.

Self-supervised learning (SSL) models have become a central component of modern speech processing systems, as they enable the learning of rich acoustic representations without reliance on labeled data. Despite their success on adult speech, it remains unclear how effectively these models capture speaker-related attributes such as age and gender in children's speech, which differs substantially from adult speech due to ongoing physiological and cognitive development. Higher pitch, increased articulatory variability, and age-dependent acoustic changes make children's speech a particularly challenging domain. In this work, we present a comprehensive analysis of how age and gender information is encoded across layers of four widely used SSL models: Wav2Vec2, HuBERT, Data2Vec, and WavLM. Layer-wise features are extracted and evaluated using a lightweight CNN on two benchmark children's speech corpora, PFSTAR and CMU Kids. To analyze feature compactness and redundancy, PCA is applied to identify redundancy and highlight the dimensions that contribute most to classification performance. Experimental results show that age- and gender-related information is unevenly distributed across SSL layers, with early to mid-level layers encoding the strongest paralinguistic cues. HuBERT achieves the best overall performance for age classification, while Wav2Vec2 and HuBERT lead gender classification on PFSTAR and CMU Kids, respectively. Beyond single-split evaluation, we further demonstrate that these findings remain stable under speaker-wise cross-validation, layer aggregation, and cross-database evaluation, indicating robustness to data imbalance and domain mismatch. Finally, we show that reliable age and gender classification is achievable even from short speech segments of 1--3 seconds.

2.1ASJun 18

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo et al.

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

1.2ASAug 28, 2025

Zero-Shot KWS for Children's Speech using Layer-Wise Features from SSL Models

Subham Kutum, Abhijit Sinha, Hemant Kumar Kathania et al.

Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children's speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children's speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children's speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system's effectiveness across different age groups of children. To assess the system's robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children's speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.

2.3ASAug 28, 2025

Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?

Abhijit Sinha, Hemant Kumar Kathania, Sudarsana Reddy Kadiri et al.

Automatic Speech Recognition (ASR) systems often struggle to accurately process children's speech due to its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children's speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children's speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children's speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.

3.3ASAug 29, 2020Code

Data augmentation using prosody and false starts to recognize non-native children's speech

Hemant Kathania, Mittul Singh, Tamás Grósz et al.

This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach. Firstly, we apply the prosody-based data augmentation to supplement the audio data. Secondly, we simulate false starts by introducing partial-word noise in the language modeling corpora creating new words. Acoustic models trained on prosody-based augmented data outperform the models using the baseline recipe or the SpecAugment-based augmentation. The partial-word noise also helps to improve the baseline language model. Our ASR system, a combination of these schemes, is placed third in the evaluation period and achieves the word error rate of 18.71%. Post-evaluation period, we observe that increasing the amounts of prosody-based augmented data leads to better performance. Furthermore, removing low-confidence-score words from hypotheses can lead to further gains. These two improvements lower the ASR error rate to 17.99%.

1.2ASAug 6, 2020

Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

Tamás Grósz, Mittul Singh, Sudarsana Reddy Kadiri et al.

End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task performance. On the elderly sub-challenge, predicting the valence and arousal levels prompts us to investigate multi-task training and implement data sampling strategies to handle class imbalance. On the mask sub-challenge, using an E2E system without feature engineering is competitive to feature-engineered baselines and provides substantial gains when combined with feature-engineered baselines.