CLNov 30, 2022
An Overview of Indian Spoken Language Recognition from Machine Learning PerspectiveSpandan Dey, Md Sahidullah, Goutam Saha
Automatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields.
ASFeb 10, 2023
Cross-Corpora Spoken Language Identification with Domain Diversification and GeneralizationSpandan Dey, Md Sahidullah, Goutam Saha
This work addresses the cross-corpora generalization issue for the low-resourced spoken language identification (LID) problem. We have conducted the experiments in the context of Indian LID and identified strikingly poor cross-corpora generalization due to corpora-dependent non-lingual biases. Our contribution to this work is twofold. First, we propose domain diversification, which diversifies the limited training data using different audio data augmentation methods. We then propose the concept of maximally diversity-aware cascaded augmentations and optimize the augmentation fold-factor for effective diversification of the training data. Second, we introduce the idea of domain generalization considering the augmentation methods as pseudo-domains. Towards this, we investigate both domain-invariant and domain-aware approaches. Our LID system is based on the state-of-the-art emphasized channel attention, propagation, and aggregation based time delay neural network (ECAPA-TDNN) architecture. We have conducted extensive experiments with three widely used corpora for Indian LID research. In addition, we conduct a final blind evaluation of our proposed methods on the Indian subset of VoxLingua107 corpus collected in the wild. Our experiments demonstrate that the proposed domain diversification is more promising over commonly used simple augmentation methods. The study also reveals that domain generalization is a more effective solution than domain diversification. We also notice that domain-aware learning performs better for same-corpora LID, whereas domain-invariant learning is more suitable for cross-corpora generalization. Compared to basic ECAPA-TDNN, its proposed domain-invariant extensions improve the cross-corpora EER up to 5.23%. In contrast, the proposed domain-aware extensions also improve performance for same-corpora test scenarios.
ASOct 1, 2023
Wavelet Scattering Transform for Improving Generalization in Low-Resourced Spoken Language IdentificationSpandan Dey, Premjeet Singh, Goutam Saha
Commonly used features in spoken language identification (LID), such as mel-spectrogram or MFCC, lose high-frequency information due to windowing. The loss further increases for longer temporal contexts. To improve generalization of the low-resourced LID systems, we investigate an alternate feature representation, wavelet scattering transform (WST), that compensates for the shortcomings. To our knowledge, WST is not explored earlier in LID tasks. We first optimize WST features for multiple South Asian LID corpora. We show that LID requires low octave resolution and frequency-scattering is not useful. Further, cross-corpora evaluations show that the optimal WST hyper-parameters depend on both train and test corpora. Hence, we develop fused ECAPA-TDNN based LID systems with different sets of WST hyper-parameters to improve generalization for unknown data. Compared to MFCC, EER is reduced upto 14.05% and 6.40% for same-corpora and blind VoxLingua107 evaluations, respectively.
ASMay 11, 2021
Deep scattering network for speech emotion recognitionPremjeet Singh, Goutam Saha, Md Sahidullah
This paper introduces scattering transform for speech emotion recognition (SER). Scattering transform generates feature representations which remain stable to deformations and shifting in time and frequency without much loss of information. In speech, the emotion cues are spread across time and localised in frequency. The time and frequency invariance characteristic of scattering coefficients provides a representation robust against emotion irrelevant variations e.g., different speakers, language, gender etc. while preserving the variations caused by emotion cues. Hence, such a representation captures the emotion information more efficiently from speech. We perform experiments to compare scattering coefficients with standard mel-frequency cepstral coefficients (MFCCs) over different databases. It is observed that frequency scattering performs better than time-domain scattering and MFCCs. We also investigate layer-wise scattering coefficients to analyse the importance of time shift and deformation stable scalogram and modulation spectrum coefficients for SER. We observe that layer-wise coefficients taken independently also perform better than MFCCs.
ASMay 10, 2021
Cross-Corpora Language Recognition: A Preliminary Investigation with Indian LanguagesSpandan Dey, Goutam Saha, Md Sahidullah
In this paper, we conduct one of the very first studies for cross-corpora performance evaluation in the spoken language identification (LID) problem. Cross-corpora evaluation was not explored much in LID research, especially for the Indian languages. We have selected three Indian spoken language corpora: IIITH-ILSC, LDC South Asian, and IITKGP-MLILSC. For each of the corpus, LID systems are trained on the state-of-the-art time-delay neural network (TDNN) based architecture with MFCC features. We observe that the LID performance degrades drastically for cross-corpora evaluation. For example, the system trained on the IIITH-ILSC corpus shows an average EER of 11.80 % and 43.34 % when evaluated with the same corpora and LDC South Asian corpora, respectively. Our preliminary analysis shows the significant differences among these corpora in terms of mismatch in the long-term average spectrum (LTAS) and signal-to-noise ratio (SNR). Subsequently, we apply different feature level compensation methods to reduce the cross-corpora acoustic mismatch. Our results indicate that these feature normalization schemes can help to achieve promising LID performance on cross-corpora experiments.
ASFeb 10, 2021
ABSP System for The Third DIHARD ChallengeA Kishore Kumar, Shefali Waldekar, Goutam Saha et al.
This report describes the speaker diarization system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our primary contribution is to develop acoustic domain identification (ADI) system for speaker diarization. We investigate speaker embeddings based ADI system. We apply a domain-dependent threshold for agglomerative hierarchical clustering. Besides, we optimize the parameters for PCA-based dimensionality reduction in a domain-dependent way. Our method of integrating domain-based processing schemes in the baseline system of the challenge achieved a relative improvement of $9.63\%$ and $10.64\%$ in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set.
ASFeb 8, 2021
Non-linear frequency warping using constant-Q transformation for speech emotion recognitionPremjeet Singh, Goutam Saha, Md Sahidullah
In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data.
SDJan 25, 2021
Domain-Dependent Speaker Diarization for the Third DIHARD ChallengeA Kishore Kumar, Shefali Waldekar, Goutam Saha et al.
This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our main contribution in this work is to develop a simple and efficient solution for acoustic domain dependent speech diarization. We explore speaker embeddings for \emph{acoustic domain identification} (ADI) task. Our study reveals that i-vector based method achieves considerably better performance than x-vector based approach in the third DIHARD challenge dataset. Next, we integrate the ADI module with the diarization framework. The performance substantially improved over that of the baseline when we optimized the thresholds for agglomerative hierarchical clustering and the parameters for dimensionality reduction during scoring for individual acoustic domains. We achieved a relative improvement of $9.63\%$ and $10.64\%$ in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set.
ASJul 21, 2020
Optimization of data-driven filterbank for automatic speaker verificationSusanta Sarangi, Md Sahidullah, Goutam Saha
Most of the speech processing applications use triangular filters spaced in mel-scale for feature extraction. In this paper, we propose a new data-driven filter design method which optimizes filter parameters from a given speech data. First, we introduce a frame-selection based approach for developing speech-signal-based frequency warping scale. Then, we propose a new method for computing the filter frequency responses by using principal component analysis (PCA). The main advantage of the proposed method over the recently introduced deep learning based methods is that it requires very limited amount of unlabeled speech-data. We demonstrate that the proposed filterbank has more speaker discriminative power than commonly used mel filterbank as well as existing data-driven filterbank. We conduct automatic speaker verification (ASV) experiments with different corpora using various classifier back-ends. We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75% relative improvement in equal error rate (EER) over MFCCs. Similarly, the relative improvement is 4.43% with recently introduced x-vector system. We obtain further improvement using fusion of the proposed method with standard MFCC-based approach.
CVJan 29, 2019
Quality Measures for Speaker Verification with Short UtterancesArnab Poddar, Md Sahidullah, Goutam Saha
The performances of the automatic speaker verification (ASV) systems degrade due to the reduction in the amount of speech used for enrollment and verification. Combining multiple systems based on different features and classifiers considerably reduces speaker verification error rate with short utterances. This work attempts to incorporate supplementary information during the system combination process. We use quality of the estimated model parameters as supplementary information. We introduce a class of novel quality measures formulated using the zero-order sufficient statistics used during the i-vector extraction process. We have used the proposed quality measures as side information for combining ASV systems based on Gaussian mixture model-universal background model (GMM-UBM) and i-vector. The proposed methods demonstrate considerable improvement in speaker recognition performance on NIST SRE corpora, especially in short duration conditions. We have also observed improvement over existing systems based on different duration-based quality measures.
MMJan 23, 2019
Generalization of Spoofing Countermeasures: a Case Study with ASVspoof 2015 and BTAS 2016 CorporaDipjyoti Paul, Md Sahidullah, Goutam Saha
Voice-based biometric systems are highly prone to spoofing attacks. Recently, various countermeasures have been developed for detecting different kinds of attacks such as replay, speech synthesis (SS) and voice conversion (VC). Most of the existing studies are conducted with a specific training set defined by the evaluation protocol. However, for realistic scenarios, selecting appropriate training data is an open challenge for the system administrator. Motivated by this practical concern, this work investigates the generalization capability of spoofing countermeasures in restricted training conditions where speech from a broad attack types are left out in the training database. We demonstrate that different spoofing types have considerably different generalization capabilities. For this study, we analyze the performance using two kinds of features, mel-frequency cepstral coefficients (MFCCs) which are considered as baseline and recently proposed constant Q cepstral coefficients (CQCCs). The experiments are conducted with standard Gaussian mixture model - maximum likelihood (GMM-ML) classifier on two recently released spoofing corpora: ASVspoof 2015 and BTAS 2016 that includes cross-corpora performance analysis. Feature-level analysis suggests that static and dynamic coefficients of spectral features, both are important for detecting spoofing attacks in the real-life condition.
MMDec 3, 2018
Novel Quality Metric for Duration Variability Compensation in Speaker Verification using i-VectorsArnab Poddar, Md Sahidullah, Goutam Saha
Automatic speaker verification (ASV) is the process to recognize persons using voice as biometric. The ASV systems show considerable recognition performance with sufficient amount of speech from matched condition. One of the crucial challenges of ASV technology is to improve recognition performance with speech segments of short duration. In short duration condition, the model parameters are not properly estimated due to inadequate speech information, and this results poor recognition accuracy even with the state-of-the-art i-vector based ASV system. We hypothesize that considering the estimation quality during recognition process would help to improve the ASV performance. This can be incorporated as a quality measure during fusion of ASV systems. This paper investigates a new quality measure for i-vector representation of speech utterances computed directly from Baum-Welch statistics. The proposed metric is subsequently used as quality measure during fusion of ASV systems. In experiments with the NIST SRE 2008 corpus, We have shown that inclusion of proposed quality metric exhibits considerable improvement in speaker verification performance. The results also indicate the potentiality of the proposed method in real-world scenario with short test utterances.
SDDec 22, 2016
Robustness of Voice Conversion Techniques Under Mismatched ConditionsMonisankha Pal, Dipjyoti Paul, Md Sahidullah et al.
Most of the existing studies on voice conversion (VC) are conducted in acoustically matched conditions between source and target signal. However, the robustness of VC methods in presence of mismatch remains unknown. In this paper, we report a comparative analysis of different VC techniques under mismatched conditions. The extensive experiments with five different VC techniques on CMU ARCTIC corpus suggest that performance of VC methods substantially degrades in noisy conditions. We have found that bilinear frequency warping with amplitude scaling (BLFWAS) outperforms other methods in most of the noisy conditions. We further explore the suitability of different speech enhancement techniques for robust conversion. The objective evaluation results indicate that spectral subtraction and log minimum mean square error (logMMSE) based speech enhancement techniques can be used to improve the performance in specific noisy conditions.
SDMar 14, 2016
Novel Speech Features for Improved Detection of Spoofing AttacksDipjyoti Paul, Monisankha Pal, Goutam Saha
Now-a-days, speech-based biometric systems such as automatic speaker verification (ASV) are highly prone to spoofing attacks by an imposture. With recent development in various voice conversion (VC) and speech synthesis (SS) algorithms, these spoofing attacks can pose a serious potential threat to the current state-of-the-art ASV systems. To impede such attacks and enhance the security of the ASV systems, the development of efficient anti-spoofing algorithms is essential that can differentiate synthetic or converted speech from natural or human speech. In this paper, we propose a set of novel speech features for detecting spoofing attacks. The proposed features are computed using alternative frequency-warping technique and formant-specific block transformation of filter bank log energies. We have evaluated existing and proposed features against several kinds of synthetic speech data from ASVspoof 2015 corpora. The results show that the proposed techniques outperform existing approaches for various spoofing attack detection task. The techniques investigated in this paper can also accurately classify natural and synthetic speech as equal error rates (EERs) of 0% have been achieved.
AIAug 21, 2015
Recurrent Neural Network Based Modeling of Gene Regulatory Network Using Bat AlgorithmSudip Mandal, Goutam Saha, Rajat K. Pal
Correct inference of genetic regulations inside a cell is one of the greatest challenges in post genomic era for the biologist and researchers. Several intelligent techniques and models were already proposed to identify the regulatory relations among genes from the biological database like time series microarray data. Recurrent Neural Network (RNN) is one of the most popular and simple approach to model the dynamics as well as to infer correct dependencies among genes. In this paper, Bat Algorithm (BA) is applied to optimize the model parameters of RNN model of Gene Regulatory Network (GRN). Initially the proposed method is tested against small artificial network without any noise and the efficiency is observed in term of number of iteration, number of population and BA optimization parameters. The model is also validated in presence of different level of random noise for the small artificial network and that proved its ability to infer the correct inferences in presence of noise like real world dataset. In the next phase of this research, BA based RNN is applied to real world benchmark time series microarray dataset of E. coli. The results prove that it can able to identify the maximum number of true positive regulation but also include some false positive regulations. Therefore, BA is very suitable for identifying biological plausible GRN with the help RNN model.
MMOct 1, 2012
Comparison of Speech Activity Detection Techniques for Speaker RecognitionMd. Sahidullah, Goutam Saha
Speech activity detection (SAD) is an essential component for a variety of speech processing applications. It has been observed that performances of various speech based tasks are very much dependent on the efficiency of the SAD. In this paper, we have systematically reviewed some popular SAD techniques and their applications in speaker recognition. Speaker verification system using different SAD technique are experimentally evaluated on NIST speech corpora using Gaussian mixture model- universal background model (GMM-UBM) based classifier for clean and noisy conditions. It has been found that two Gaussian modeling based SAD is comparatively better than other SAD techniques for different types of noises.
CVJun 12, 2012
A Novel Windowing Technique for Efficient Computation of MFCC for Speaker RecognitionMd. Sahidullah, Goutam Saha
In this paper, we propose a novel family of windowing technique to compute Mel Frequency Cepstral Coefficient (MFCC) for automatic speaker recognition from speech. The proposed method is based on fundamental property of discrete time Fourier transform (DTFT) related to differentiation in frequency domain. Classical windowing scheme such as Hamming window is modified to obtain derivatives of discrete time Fourier transform coefficients. It has been mathematically shown that the slope and phase of power spectrum are inherently incorporated in newly computed cepstrum. Speaker recognition systems based on our proposed family of window functions are shown to attain substantial and consistent performance improvement over baseline single tapered Hamming window as well as recently proposed multitaper windowing technique.