Ismail Shahin

SD
33papers
635citations
Novelty35%
AI Score23

33 Papers

SDJan 9, 2022
Emotional Speaker Identification using a Novel Capsule Nets Model

Ali Bou Nassif, Ismail Shahin, Ashraf Elnagar et al.

Speaker recognition systems are widely used in various applications to identify a person by their voice; however, the high degree of variability in speech signals makes this a challenging task. Dealing with emotional variations is very difficult because emotions alter the voice characteristics of a person; thus, the acoustic features differ from those used to train models in a neutral environment. Therefore, speaker recognition models trained on neutral speech fail to correctly identify speakers under emotional stress. Although considerable advancements in speaker identification have been made using convolutional neural networks (CNN), CNNs cannot exploit the spatial association between low-level features. Inspired by the recent introduction of capsule networks (CapsNets), which are based on deep learning to overcome the inadequacy of CNNs in preserving the pose relationship between low-level features with their pooling technique, this study investigates the performance of using CapsNets in identifying speakers from emotional speech recordings. A CapsNet-based speaker identification model is proposed and evaluated using three distinct speech databases, i.e., the Emirati Speech Database, SUSAS Dataset, and RAVDESS (open-access). The proposed model is also compared to baseline systems. Experimental results demonstrate that the novel proposed CapsNet model trains faster and provides better results over current state-of-the-art schemes. The effect of the routing algorithm on speaker identification performance was also studied by varying the number of iterations, both with and without a decoder network.

SDDec 26, 2021
Novel Hybrid DNN Approaches for Speaker Verification in Emotional and Stressful Talking Environments

Ismail Shahin, Ali Bou Nassif, Nawel Nemmour et al.

In this work, we conducted an empirical comparative study of the performance of text-independent speaker verification in emotional and stressful environments. This work combined deep models with shallow architecture, which resulted in novel hybrid classifiers. Four distinct hybrid models were utilized: deep neural network-hidden Markov model (DNN-HMM), deep neural network-Gaussian mixture model (DNN-GMM), Gaussian mixture model-deep neural network (GMM-DNN), and hidden Markov model-deep neural network (HMM-DNN). All models were based on novel implemented architecture. The comparative study used three distinct speech datasets: a private Arabic dataset and two public English databases, namely, Speech Under Simulated and Actual Stress (SUSAS) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The test results of the aforementioned hybrid models demonstrated that the proposed HMM-DNN leveraged the verification performance in emotional and stressful environments. Results also showed that HMM-DNN outperformed all other hybrid models in terms of equal error rate (EER) and area under the curve (AUC) evaluation metrics. The average resulting verification system based on the three datasets yielded EERs of 7.19%, 16.85%, 11.51%, and 11.90% based on HMM-DNN, DNN-HMM, DNN-GMM, and GMM-DNN, respectively. Furthermore, we found that the DNN-GMM model demonstrated the least computational complexity compared to all other hybrid models in both talking environments. Conversely, the HMM-DNN model required the greatest amount of training time. Findings also demonstrated that EER and AUC values depended on the database when comparing average emotional and stressful performances.

SDDec 26, 2021
Novel Dual-Channel Long Short-Term Memory Compressed Capsule Networks for Emotion Recognition

Ismail Shahin, Noor Hindawi, Ali Bou Nassif et al.

Recent analysis on speech emotion recognition has made considerable advances with the use of MFCCs spectrogram features and the implementation of neural network approaches such as convolutional neural networks (CNNs). Capsule networks (CapsNet) have gained gratitude as alternatives to CNNs with their larger capacities for hierarchical representation. To address these issues, this research introduces a text-independent and speaker-independent SER novel architecture, where a dual-channel long short-term memory compressed-CapsNet (DC-LSTM COMP-CapsNet) algorithm is proposed based on the structural features of CapsNet. Our proposed novel classifier can ensure the energy efficiency of the model and adequate compression method in speech emotion recognition, which is not delivered through the original structure of a CapsNet. Moreover, the grid search approach is used to attain optimal solutions. Results witnessed an improved performance and reduction in the training and testing running time. The speech datasets used to evaluate our algorithm are: Arabic Emirati-accented corpus, English speech under simulated and actual stress corpus, English Ryerson audio-visual database of emotional speech and song corpus, and crowd-sourced emotional multimodal actors dataset. This work reveals that the optimum feature extraction method compared to other known methods is MFCCs delta-delta. Using the four datasets and the MFCCs delta-delta, DC-LSTM COMP-CapsNet surpasses all the state-of-the-art systems, classical classifiers, CNN, and the original CapsNet. Using the Arabic Emirati-accented corpus, our results demonstrate that the proposed work yields average emotion recognition accuracy of 89.3% compared to 84.7%, 82.2%, 69.8%, 69.2%, 53.8%, 42.6%, and 31.9% based on CapsNet, CNN, support vector machine, multi-layer perceptron, k-nearest neighbor, radial basis function, and naive Bayes, respectively.

LGDec 15, 2021
COVID-19 Electrocardiograms Classification using CNN Models

Ismail Shahin, Ali Bou Nassif, Mohamed Bader Alsabek

With the periodic rise and fall of COVID-19 and numerous countries being affected by its ramifications, there has been a tremendous amount of work that has been done by scientists, researchers, and doctors all over the world. Prompt intervention is keenly needed to tackle the unconscionable dissemination of the disease. The implementation of Artificial Intelligence (AI) has made a significant contribution to the digital health district by applying the fundamentals of deep learning algorithms. In this study, a novel approach is proposed to automatically diagnose the COVID-19 by the utilization of Electrocardiogram (ECG) data with the integration of deep learning algorithms, specifically the Convolutional Neural Network (CNN) models. Several CNN models have been utilized in this proposed framework, including VGG16, VGG19, InceptionResnetv2, InceptionV3, Resnet50, and Densenet201. The VGG16 model has outperformed the rest of the models, with an accuracy of 85.92%. Our results show a relatively low accuracy in the rest of the models compared to the VGG16 model, which is due to the small size of the utilized dataset, in addition to the exclusive utilization of the Grid search hyperparameters optimization approach for the VGG16 model only. Moreover, our results are preparatory, and there is a possibility to enhance the accuracy of all models by further expanding the dataset and adapting a suitable hyperparameters optimization technique.

SDDec 15, 2021
The exploitation of Multiple Feature Extraction Techniques for Speaker Identification in Emotional States under Disguised Voices

Noor Ahmad Al Hindawi, Ismail Shahin, Ali Bou Nassif

Due to improvements in artificial intelligence, speaker identification (SI) technologies have brought a great direction and are now widely used in a variety of sectors. One of the most important components of SI is feature extraction, which has a substantial impact on the SI process and performance. As a result, numerous feature extraction strategies are thoroughly investigated, contrasted, and analyzed. This article exploits five distinct feature extraction methods for speaker identification in disguised voices under emotional environments. To evaluate this work significantly, three effects are used: high-pitched, low-pitched, and Electronic Voice Conversion (EVC). Experimental results reported that the concatenated Mel-Frequency Cepstral Coefficients (MFCCs), MFCCs-delta, and MFCCs-delta-delta is the best feature extraction method.

SDFeb 11, 2021
CASA-Based Speaker Identification Using Cascaded GMM-CNN Classifier in Noisy and Emotional Talking Conditions

Ali Bou Nassif, Ismail Shahin, Shibani Hamsa et al.

This work aims at intensifying text-independent speaker identification performance in real application situations such as noisy and emotional talking conditions. This is achieved by incorporating two different modules: a Computational Auditory Scene Analysis CASA based pre-processing module for noise reduction and cascaded Gaussian Mixture Model Convolutional Neural Network GMM-CNN classifier for speaker identification followed by emotion recognition. This research proposes and evaluates a novel algorithm to improve the accuracy of speaker identification in emotional and highly-noise susceptible conditions. Experiments demonstrate that the proposed model yields promising results in comparison with other classifiers when Speech Under Simulated and Actual Stress SUSAS database, Emirati Speech Database ESD, the Ryerson Audio-Visual Database of Emotional Speech and Song RAVDESS database and the Fluent Speech Commands database are used in a noisy environment.

SDOct 17, 2020
Studying the Similarity of COVID-19 Sounds based on Correlation Analysis of MFCC

Mohamed Bader, Ismail Shahin, Abdelfatah Hassan

Recently there has been a formidable work which has been put up from the people who are working in the frontlines such as hospitals, clinics, and labs alongside researchers and scientists who are also putting tremendous efforts in the fight against COVID-19 pandemic. Due to the preposterous spread of the virus, the integration of the artificial intelligence has taken a considerable part in the health sector, by implementing the fundamentals of Automatic Speech Recognition (ASR) and deep learning algorithms. In this paper, we illustrate the importance of speech signal processing in the extraction of the Mel-Frequency Cepstral Coefficients (MFCCs) of the COVID-19 and non-COVID-19 samples and find their relationship using Pearson correlation coefficients. Our results show high similarity in MFCCs between different COVID-19 cough and breathing sounds, while MFCC of voice is more robust between COVID-19 and non-COVID-19 samples. Moreover, our results are preliminary, and there is a possibility to exclude the voices of COVID-19 patients from further processing in diagnosing the disease.

ASFeb 4, 2020
Emotion Recognition Using Speaker Cues

Ismail Shahin

This research aims at identifying the unknown emotion using speaker cues. In this study, we identify the unknown emotion using a two-stage framework. The first stage focuses on identifying the speaker who uttered the unknown emotion, while the next stage focuses on identifying the unknown emotion uttered by the recognized speaker in the prior stage. This proposed framework has been evaluated on an Arabic Emirati-accented speech database uttered by fifteen speakers per gender. Mel-Frequency Cepstral Coefficients (MFCCs) have been used as the extracted features and Hidden Markov Model (HMM) has been utilized as the classifier in this work. Our findings demonstrate that emotion recognition accuracy based on the two-stage framework is greater than that based on the one-stage approach and the state-of-the-art classifiers and models such as Gaussian Mixture Model (GMM), Support Vector Machine (SVM), and Vector Quantization (VQ). The average emotion recognition accuracy based on the two-stage approach is 67.5%, while the accuracy reaches to 61.4%, 63.3%, 64.5%, and 61.5%, based on the one-stage approach, GMM, SVM, and VQ, respectively. The achieved results based on the two-stage framework are very close to those attained in subjective assessment by human listeners.

SDSep 29, 2019
Speaker Verification in Emotional Talking Environments based on Third-Order Circular Suprasegmental Hidden Markov Model

Ismail Shahin, Ali Bou Nassif

Speaker verification accuracy in emotional talking environments is not high as it is in neutral ones. This work aims at accepting or rejecting the claimed speaker using his/her voice in emotional environments based on the Third-Order Circular Suprasegmental Hidden Markov Model (CSPHMM3) as a classifier. An Emirati-accented (Arabic) speech database with Mel-Frequency Cepstral Coefficients as the extracted features has been used to evaluate our work. Our results demonstrate that speaker verification accuracy based on CSPHMM3 is greater than that based on the state-of-the-art classifiers and models such as Gaussian Mixture Model (GMM), Support Vector Machine (SVM), and Vector Quantization (VQ).

SDSep 28, 2019
Emirati-Accented Speaker Identification in Stressful Talking Conditions

Ismail Shahin, Ali Bou Nassif

This research is dedicated to improving text-independent Emirati-accented speaker identification performance in stressful talking conditions using three distinct classifiers: First-Order Hidden Markov Models (HMM1s), Second-Order Hidden Markov Models (HMM2s), and Third-Order Hidden Markov Models (HMM3s). The database that has been used in this work was collected from 25 per gender Emirati native speakers uttering eight widespread Emirati sentences in each of neutral, shouted, slow, loud, soft, and fast talking conditions. The extracted features of the captured database are called Mel-Frequency Cepstral Coefficients (MFCCs). Based on HMM1s, HMM2s, and HMM3s, average Emirati-accented speaker identification accuracy in stressful conditions is 58.6%, 61.1%, and 65.0%, respectively. The achieved average speaker identification accuracy in stressful conditions based on HMM3s is so similar to that attained in subjective assessment by human listeners.

SDMar 23, 2019
Emotion Recognition based on Third-Order Circular Suprasegmental Hidden Markov Model

Ismail Shahin

This work focuses on recognizing the unknown emotion based on the Third-Order Circular Suprasegmental Hidden Markov Model (CSPHMM3) as a classifier. Our work has been tested on Emotional Prosody Speech and Transcripts (EPST) database. The extracted features of EPST database are Mel-Frequency Cepstral Coefficients (MFCCs). Our results give average emotion recognition accuracy of 77.8% based on the CSPHMM3. The results of this work demonstrate that CSPHMM3 is superior to the Third-Order Hidden Markov Model (HMM3), Gaussian Mixture Model (GMM), Support Vector Machine (SVM), and Vector Quantization (VQ) by 6.0%, 4.9%, 3.5%, and 5.4%, respectively, for emotion recognition. The average emotion recognition accuracy achieved based on the CSPHMM3 is comparable to that found using subjective assessment by human judges.

SDOct 11, 2018
Novel Cascaded Gaussian Mixture Model-Deep Neural Network Classifier for Speaker Identification in Emotional Talking Environments

Ismail Shahin, Ali Bou Nassif, Shibani Hamsa

This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian Mixture Model-Deep Neural Network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases: Emirati speech database (Arabic United Arab Emirates dataset) and Speech Under Simulated and Actual Stress (SUSAS) English dataset. The proposed classifier outperforms classical classifiers such as Multilayer Perceptron (MLP) and Support Vector Machine (SVM) in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.

SDSep 3, 2018
Three-Stage Speaker Verification Architecture in Emotional Talking Environments

Ismail Shahin, Ali Bou Nassif

Speaker verification performance in neutral talking environment is usually high, while it is sharply decreased in emotional talking environments. This performance degradation in emotional environments is due to the problem of mismatch between training in neutral environment while testing in emotional environments. In this work, a three-stage speaker verification architecture has been proposed to enhance speaker verification performance in emotional environments. This architecture is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been evaluated on two distinct and independent emotional speech datasets: in-house dataset and Emotional Prosody Speech and Transcripts dataset. Our results show that speaker verification based on both gender information and emotion information is superior to each of speaker verification based on gender information only, emotion information only, and neither gender information nor emotion information. The attained average speaker verification performance based on the proposed framework is very alike to that attained in subjective assessment by human listeners.

SDMar 31, 2018
Speaker Verification in Emotional Talking Environments based on Three-Stage Framework

Ismail Shahin

This work is dedicated to introducing, executing, and assessing a three-stage speaker verification framework to enhance the degraded speaker verification performance in emotional talking environments. Our framework is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been assessed on two distinct and independent emotional speech datasets: our collected dataset and Emotional Prosody Speech and Transcripts dataset. Our results demonstrate that speaker verification based on both gender cues and emotion cues is superior to each of speaker verification based on gender cues only, emotion cues only, and neither gender cues nor emotion cues. The achieved average speaker verification performance based on the suggested methodology is very similar to that attained in subjective assessment by human listeners.

SDMar 31, 2018
Emirati-Accented Speaker Identification in each of Neutral and Shouted Talking Environments

Ismail Shahin, Ali Bou Nassif, Mohammed Bahutair

This work is devoted to capturing Emirati-accented speech database (Arabic United Arab Emirates database) in each of neutral and shouted talking environments in order to study and enhance text-independent Emirati-accented speaker identification performance in shouted environment based on each of First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s), Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s), and Third-Order Circular Suprasegmental Hidden Markov Models (CSPHMM3s) as classifiers. In this research, our database was collected from fifty Emirati native speakers (twenty five per gender) uttering eight common Emirati sentences in each of neutral and shouted talking environments. The extracted features of our collected database are called Mel-Frequency Cepstral Coefficients (MFCCs). Our results show that average Emirati-accented speaker identification performance in neutral environment is 94.0%, 95.2%, and 95.9% based on CSPHMM1s, CSPHMM2s, and CSPHMM3s, respectively. On the other hand, the average performance in shouted environment is 51.3%, 55.5%, and 59.3% based, respectively, on CSPHMM1s, CSPHMM2s, and CSPHMM3s. The achieved average speaker identification performance in shouted environment based on CSPHMM3s is very similar to that obtained in subjective assessment by human listeners.

SDJan 22, 2018
Identifying Speakers Using Their Emotion Cues

Ismail Shahin

This paper addresses the formulation of a new speaker identification approach which employs knowledge of emotional content of speaker information. Our proposed approach in this work is based on a two-stage recognizer that combines and integrates both emotion recognizer and speaker recognizer into one recognizer. The proposed approach employs both Hidden Markov Models (HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. In the experiments, six emotions are considered including neutral, angry, sad, happy, disgust and fear. Our results show that average speaker identification performance based on the proposed two-stage recognizer is 79.92% with a significant improvement over a one-stage recognizer with an identification performance of 71.58%. The results obtained based on the proposed approach are close to those achieved in subjective evaluation by human listeners.

SDJan 20, 2018
Gender-dependent emotion recognition based on HMMs and SPHMMs

Ismail Shahin

It is well known that emotion recognition performance is not ideal. The work of this research is devoted to improving emotion recognition performance by employing a two-stage recognizer that combines and integrates gender recognizer and emotion recognizer into one system. Hidden Markov Models (HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) have been used as classifiers in the two-stage recognizer. This recognizer has been tested on two distinct and separate emotional speech databases. The first database is our collected database and the second one is the Emotional Prosody Speech and Transcripts database. Six basic emotions including the neutral state have been used in each database. Our results show that emotion recognition performance based on the two-stage approach (gender-dependent emotion recognizer) has been significantly improved compared to that based on emotion recognizer without gender information and emotion recognizer with correct gender information by an average of 11% and 5%, respectively. This work shows that the highest emotion identification performance takes place when the classifiers are completely biased towards suprasegmental models and no impact of acoustic models. The results achieved based on the two-stage framework fall within 2.28% of those obtained in subjective assessment by human judges.

SDJul 2, 2017
Speaker Identification in a Shouted Talking Environment Based on Novel Third-Order Circular Suprasegmental Hidden Markov Models

Ismail Shahin

It is well known that speaker identification yields very high performance in a neutral talking environment, on the other hand, the performance has been sharply declined in a shouted talking environment. This work aims at proposing, implementing, and evaluating novel Third-Order Circular Suprasegmental Hidden Markov Models (CSPHMM3s) to improve the low performance of text-independent speaker identification in a shouted talking environment. CSPHMM3s possess combined characteristics of: Circular Hidden Markov Models (CHMMs), Third-Order Hidden Markov Models (HMM3s), and Suprasegmental Hidden Markov Models (SPHMMs). Our results show that CSPHMM3s are superior to each of: First-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM1s), Second-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM2s), Third-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM3s), First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s), and Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) in a shouted talking environment. Using our collected speech database, average speaker identification performance in a shouted talking environment based on LTRSPHMM1s, LTRSPHMM2s, LTRSPHMM3s, CSPHMM1s, CSPHMM2s, and CSPHMM3s is 74.6%, 78.4%, 81.7%, 78.7%, 83.4%, and 85.8%, respectively. Speaker identification performance that has been achieved based on CSPHMM3s is close to that attained based on subjective assessment by human listeners.

SDJul 2, 2017
Emirati Speaker Verification Based on HMM1s, HMM2s, and HMM3s

Ismail Shahin

This work focuses on Emirati speaker verification systems in neutral talking environments based on each of First-Order Hidden Markov Models (HMM1s), Second-Order Hidden Markov Models (HMM2s), and Third-Order Hidden Markov Models (HMM3s) as classifiers. These systems have been evaluated on our collected Emirati speech database which is comprised of 25 male and 25 female Emirati speakers using Mel-Frequency Cepstral Coefficients (MFCCs) as extracted features. Our results show that HMM3s outperform each of HMM1s and HMM2s for a text-independent Emirati speaker verification. The obtained results based on HMM3s are close to those achieved in subjective assessment by human listeners.

SDJul 1, 2017
Modeling and Analyzing the Vocal Tract under Normal and Stressful Talking Conditions

Ismail Shahin, Nazeih Botros

In this research, we model and analyze the vocal tract under normal and stressful talking conditions. This research answers the question of the degradation in the recognition performance of text-dependent speaker identification under stressful talking conditions. This research can be used (for future research) to improve the recognition performance under stressful talking conditions.

SDJul 1, 2017
Speaker Identification in Shouted Talking Environments Based on Novel Third-Order Hidden Markov Models

Ismail Shahin

In this work we propose, implement, and evaluate novel models called Third-Order Hidden Markov Models (HMM3s) to enhance low performance of text-independent speaker identification in shouted talking environments. The proposed models have been tested on our collected speech database using Mel-Frequency Cepstral Coefficients (MFCCs). Our results demonstrate that HMM3s significantly improve speaker identification performance in such talking environments by 11.3% and 166.7% compared to second-order hidden Markov models (HMM2s) and first-order hidden Markov models (HMM1s), respectively. The achieved results based on the proposed models are close to those obtained in subjective assessment by human listeners.

SDJul 1, 2017
Studying and Enhancing Talking Condition Recognition in Stressful and Emotional Talking Environments Based on HMMs, CHMM2s and SPHMMs

Ismail Shahin

The work of this research is devoted to studying and enhancing talking condition recognition in stressful and emotional talking environments (completely two separate environments) based on three different and separate classifiers. The three classifiers are: Hidden Markov Models (HMMs), Second-Order Circular Hidden Markov Models (CHMM2s) and Suprasegmental Hidden Markov Models (SPHMMs). The stressful talking environments that have been used in this work are composed of neutral, shouted, slow, loud, soft and fast talking conditions, while the emotional talking environments are made up of neutral, angry, sad, happy, disgust and fear emotions. The achieved results in the current work show that SPHMMs lead each of HMMs and CHMM2s in improving talking condition recognition in stressful and emotional talking environments. The results also demonstrate that talking condition recognition in stressful talking environments outperforms that in emotional talking environments by 2.7%, 1.8% and 3.3% based on HMMs, CHMM2s and SPHMMs, respectively. Based on subjective assessment by human judges, the recognition performance of stressful talking conditions leads that of emotional ones by 5.2%.

SDJul 1, 2017
Employing Emotion Cues to Verify Speakers in Emotional Talking Environments

Ismail Shahin

Usually, people talk neutrally in environments where there are no abnormal talking conditions such as stress and emotion. Other emotional conditions that might affect people talking tone like happiness, anger, and sadness. Such emotions are directly affected by the patient health status. In neutral talking environments, speakers can be easily verified, however, in emotional talking environments, speakers cannot be easily verified as in neutral talking ones. Consequently, speaker verification systems do not perform well in emotional talking environments as they do in neutral talking environments. In this work, a two-stage approach has been employed and evaluated to improve speaker verification performance in emotional talking environments. This approach employs speaker emotion cues (text-independent and emotion-dependent speaker verification problem) based on both Hidden Markov Models (HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. The approach is comprised of two cascaded stages that combines and integrates emotion recognizer and speaker recognizer into one recognizer. The architecture has been tested on two different and separate emotional speech databases: our collected database and Emotional Prosody Speech and Transcripts database. The results of this work show that the proposed approach gives promising results with a significant improvement over previous studies and other approaches such as emotion-independent speaker verification approach and emotion-dependent speaker verification approach based completely on HMMs.

SDJul 1, 2017
Talking Condition Identification Using Second-Order Hidden Markov Models

Ismail Shahin

This work focuses on enhancing the performance of text-dependent and speaker-dependent talking condition identification systems using second-order hidden Markov models (HMM2s). Our results show that the talking condition identification performance based on HMM2s has been improved significantly compared to first-order hidden Markov models (HMM1s). Our talking conditions in this work are neutral, shouted, loud, angry, happy, and fear.

AIJun 29, 2017
Speaker Identification in each of the Neutral and Shouted Talking Environments based on Gender-Dependent Approach Using SPHMMs

Ismail Shahin

It is well known that speaker identification performs extremely well in the neutral talking environments; however, the identification performance is declined sharply in the shouted talking environments. This work aims at proposing, implementing and testing a new approach to enhance the declined performance in the shouted talking environments. The new proposed approach is based on gender-dependent speaker identification using Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. This proposed approach has been tested on two different and separate speech databases: our collected database and the Speech Under Simulated and Actual Stress (SUSAS) database. The results of this work show that gender-dependent speaker identification based on SPHMMs outperforms gender-independent speaker identification based on the same models and gender-dependent speaker identification based on Hidden Markov Models (HMMs) by about 6% and 8%, respectively. The results obtained based on the proposed approach are close to those obtained in subjective evaluation by human judges.

SDJun 29, 2017
Employing both Gender and Emotion Cues to Enhance Speaker Identification Performance in Emotional Talking Environments

Ismail Shahin

Speaker recognition performance in emotional talking environments is not as high as it is in neutral talking environments. This work focuses on proposing, implementing, and evaluating a new approach to enhance the performance in emotional talking environments. The new proposed approach is based on identifying the unknown speaker using both his/her gender and emotion cues. Both Hidden Markov Models (HMMs) and Suprasegmental Hidden Markov Models (SPHMMs) have been used as classifiers in this work. This approach has been tested on our collected emotional speech database which is composed of six emotions. The results of this work show that speaker identification performance based on using both gender and emotion cues is higher than that based on using gender cues only, emotion cues only, and neither gender nor emotion cues by 7.22%, 4.45%, and 19.56%, respectively. This work also shows that the optimum speaker identification performance takes place when the classifiers are completely biased towards suprasegmental models and no impact of acoustic models in the emotional talking environments. The achieved average speaker identification performance based on the new proposed approach falls within 2.35% of that obtained in subjective evaluation by human judges.

SDJun 29, 2017
Using Second-Order Hidden Markov Model to Improve Speaker Identification Recognition Performance under Neutral Condition

Ismail Shahin

In this paper, second-order hidden Markov model (HMM2) has been used and implemented to improve the recognition performance of text-dependent speaker identification systems under neutral talking condition. Our results show that HMM2 improves the recognition performance under neutral talking condition compared to the first-order hidden Markov model (HMM1). The recognition performance has been improved by 9%.

SDJun 29, 2017
Speaker Identification Investigation and Analysis in Unbiased and Biased Emotional Talking Environments

Ismail Shahin

This work aims at investigating and analyzing speaker identification in each unbiased and biased emotional talking environments based on a classifier called Suprasegmental Hidden Markov Models (SPHMMs). The first talking environment is unbiased towards any emotion, while the second talking environment is biased towards different emotions. Each of these talking environments is made up of six distinct emotions. These emotions are neutral, angry, sad, happy, disgust and fear. The investigation and analysis of this work show that speaker identification performance in the biased talking environment is superior to that in the unbiased talking environment. The obtained results in this work are close to those achieved in subjective assessment by human judges.

SDJun 29, 2017
Speaking Style Authentication Using Suprasegmental Hidden Markov Models

Ismail Shahin

The importance of speaking style authentication from human speech is gaining an increasing attention and concern from the engineering community. The importance comes from the demand to enhance both the naturalness and efficiency of spoken language human-machine interface. Our work in this research focuses on proposing, implementing, and testing speaker-dependent and text-dependent speaking style authentication (verification) systems that accept or reject the identity claim of a speaking style based on suprasegmental hidden Markov models (SPHMMs). Based on using SPHMMs, our results show that the average speaking style authentication performance is: 99%, 37%, 85%, 60%, 61%, 59%, 41%, 61%, and 57% belonging respectively to the speaking styles: neutral, shouted, slow, loud, soft, fast, angry, happy, and fearful.

SDJun 29, 2017
Talking Condition Recognition in Stressful and Emotional Talking Environments Based on CSPHMM2s

Ismail Shahin, Mohammed Nasser Ba-Hutair

This work is aimed at exploiting Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) as classifiers to enhance talking condition recognition in stressful and emotional talking environments (completely two separate environments). The stressful talking environment that has been used in this work uses Speech Under Simulated and Actual Stress (SUSAS) database, while the emotional talking environment uses Emotional Prosody Speech and Transcripts (EPST) database. The achieved results of this work using Mel-Frequency Cepstral Coefficients (MFCCs) demonstrate that CSPHMM2s outperform each of Hidden Markov Models (HMMs), Second-Order Circular Hidden Markov Models (CHMM2s), and Suprasegmental Hidden Markov Models (SPHMMs) in enhancing talking condition recognition in the stressful and emotional talking environments. The results also show that the performance of talking condition recognition in stressful talking environments leads that in emotional talking environments by 3.67% based on CSPHMM2s. Our results obtained in subjective evaluation by human judges fall within 2.14% and 3.08% of those obtained, respectively, in stressful and emotional talking environments based on CSPHMM2s.

SDJun 29, 2017
Employing Second-Order Circular Suprasegmental Hidden Markov Models to Enhance Speaker Identification Performance in Shouted Talking Environments

Ismail Shahin

Speaker identification performance is almost perfect in neutral talking environments; however, the performance is deteriorated significantly in shouted talking environments. This work is devoted to proposing, implementing and evaluating new models called Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) to alleviate the deteriorated performance in the shouted talking environments. These proposed models possess the characteristics of both Circular Suprasegmental Hidden Markov Models (CSPHMMs) and Second-Order Suprasegmental Hidden Markov Models (SPHMM2s). The results of this work show that CSPHMM2s outperform each of: First-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM1s), Second-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM2s) and First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s) in the shouted talking environments. In such talking environments and using our collected speech database, average speaker identification performance based on LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s and CSPHMM2s is 74.6%, 78.4%, 78.7% and 83.4%, respectively. Speaker identification performance obtained based on CSPHMM2s is close to that obtained based on subjective assessment by human listeners.

SDJun 29, 2017
Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models

Ismail Shahin

It is known that the performance of speaker identification systems is high under the neutral talking condition; however, the performance deteriorates under the shouted talking condition. In this paper, second-order circular hidden Markov models (CHMM2s) have been proposed and implemented to enhance the performance of isolated-word text-dependent speaker identification systems under the shouted talking condition. Our results show that CHMM2s significantly improve speaker identification performance under such a condition compared to the first-order left-to-right hidden Markov models (LTRHMM1s), second-order left-to-right hidden Markov models (LTRHMM2s), and the first-order circular hidden Markov models (CHMM1s). Under the shouted talking condition, our results show that the average speaker identification performance is 23% based on LTRHMM1s, 59% based on LTRHMM2s, and 60% based on CHMM1s. On the other hand, the average speaker identification performance under the same talking condition based on CHMM2s is 72%.

SDJun 29, 2017
Speaker Identification in the Shouted Environment Using Suprasegmental Hidden Markov Models

Ismail Shahin

In this paper, Suprasegmental Hidden Markov Models (SPHMMs) have been used to enhance the recognition performance of text-dependent speaker identification in the shouted environment. Our speech database consists of two databases: our collected database and the Speech Under Simulated and Actual Stress (SUSAS) database. Our results show that SPHMMs significantly enhance speaker identification performance compared to Second-Order Circular Hidden Markov Models (CHMM2s) in the shouted environment. Using our collected database, speaker identification performance in this environment is 68% and 75% based on CHMM2s and SPHMMs respectively. Using the SUSAS database, speaker identification performance in the same environment is 71% and 79% based on CHMM2s and SPHMMs respectively.