Hema A. Murthy

CL
h-index34
7papers
41citations
Novelty49%
AI Score42

7 Papers

CLFeb 9
Challenges in Translating Technical Lectures: Insights from the NPTEL

Basudha Raje, Sadanand Venkatraman, Nandana TP et al.

This study examines the practical applications and methodological implications of Machine Translation in Indian Languages, specifically Bangla, Malayalam, and Telugu, within emerging translation workflows and in relation to existing evaluation frameworks. The choice of languages prioritized in this study is motivated by a triangulation of linguistic diversity, which illustrates the significance of multilingual accommodation of educational technology under NEP 2020. This is further supported by the largest MOOC portal, i.e., NPTEL, which has served as a corpus to facilitate the arguments presented in this paper. The curation of a spontaneous speech corpora that accounts for lucid delivery of technical concepts, considering the retention of suitable register and lexical choices are crucial in a diverse country like India. The findings of this study highlight metric-specific sensitivity and the challenges of morphologically rich and semantically compact features when tested against surface overlapping metrics.

CLJun 4, 2025
Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages

Utkarsh Pathak, Chandra Sai Krishna Gunda, Anusha Prakash et al.

Text-to-speech (TTS) systems typically require high-quality studio data and accurate transcriptions for training. India has 1369 languages, with 22 official using 13 scripts. Training a TTS system for all these languages, most of which have no digital resources, seems a Herculean task. Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families. The novelty of our work is in the augmentation of a shared phone representation and modifying the text parsing rules to match the phonotactics of the target language, thus reducing the synthesiser overhead and enabling rapid adaptation. Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh by leveraging linguistic connections across languages with suitable synthesisers. Evaluations confirm the effectiveness of this approach, highlighting its potential to expand speech technology access for under-represented languages.

ASMar 4, 2021
Front-end Diarization for Percussion Separation in Taniavartanam of Carnatic Music Concerts

Nauman Dawalatabad, Jilt Sebastian, Jom Kuriakose et al.

Instrument separation in an ensemble is a challenging task. In this work, we address the problem of separating the percussive voices in the taniavartanam segments of Carnatic music. In taniavartanam, a number of percussive instruments play together or in tandem. Separation of instruments in regions where only one percussion is present leads to interference and artifacts at the output, as source separation algorithms assume the presence of multiple percussive voices throughout the audio segment. We prevent this by first subjecting the taniavartanam to diarization. This process results in homogeneous clusters consisting of segments of either a single voice or multiple voices. A cluster of segments with multiple voices is identified using the Gaussian mixture model (GMM), which is then subjected to source separation. A deep recurrent neural network (DRNN) based approach is used to separate the multiple instrument segments. The effectiveness of the proposed system is evaluated on a standard Carnatic music dataset. The proposed approach provides close-to-oracle performance for non-overlapping segments and a significant improvement over traditional separation schemes.

CVNov 14, 2020
Towards Zero-Shot Learning with Fewer Seen Class Examples

Vinay Kumar Verma, Ashish Mishra, Anubha Pandey et al.

We present a meta-learning based generative model for zero-shot learning (ZSL) towards a challenging setting when the number of training examples from each \emph{seen} class is very few. This setup contrasts with the conventional ZSL approaches, where training typically assumes the availability of a sufficiently large number of training examples from each of the seen classes. The proposed approach leverages meta-learning to train a deep generative model that integrates variational autoencoder and generative adversarial networks. We propose a novel task distribution where meta-train and meta-validation classes are disjoint to simulate the ZSL behaviour in training. Once trained, the model can generate synthetic examples from seen and unseen classes. Synthesize samples can then be used to train the ZSL framework in a supervised manner. The meta-learner enables our model to generates high-fidelity samples using only a small number of training examples from seen classes. We conduct extensive experiments and ablation studies on four benchmark datasets of ZSL and observe that the proposed model outperforms state-of-the-art approaches by a significant margin when the number of examples per seen class is very small.

CVJan 18, 2020
Stacked Adversarial Network for Zero-Shot Sketch based Image Retrieval

Anubha Pandey, Ashish Mishra, Vinay Kumar Verma et al.

Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previously unseen classes during the test. This paper proposes a generative approach based on the Stacked Adversarial Network (SAN) and the advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a high-quality sample, SN learns a better distance metric compared to that of the nearest neighbor search. The capability of the generative model to synthesize image features based on the sketch reduces the SBIR problem to that of an image-to-image retrieval problem. We evaluate the efficacy of our proposed approach on TU-Berlin, and Sketchy database in both standard ZSL and generalized ZSL setting. The proposed method yields a significant improvement in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL) for SBIR.

ASApr 16, 2019
Spoof detection using time-delay shallow neural network and feature switching

Mari Ganesh Kumar, Suvidha Rupesh Kumar, Saranya M et al.

Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art \emph{x}-vector based speaker verification approach, this paper proposes a time-delay shallow neural network (TD-SNN) for spoof detection for both logical and physical access. The novelty of the proposed TD-SNN system vis-a-vis conventional DNN systems is that it can handle variable length utterances during testing. Performance of the proposed TD-SNN systems and the baseline Gaussian mixture models (GMMs) is analyzed on the ASV-spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detection cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for physical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-level feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on evaluation data with a relative improvement of 48.03\% and 49.47\% for both logical and physical access, respectively.

SDMar 17, 2016
Modified Group Delay Based MultiPitch Estimation in Co-Channel Speech

Rajeev Rajan, Hema A. Murthy

Phase processing has been replaced by group delay processing for the extraction of source and system parameters from speech. Group delay functions are ill-behaved when the transfer function has zeros that are close to unit circle in the z-domain. The modified group delay function addresses this problem and has been successfully used for formant and monopitch estimation. In this paper, modified group delay functions are used for multipitch estimation in concurrent speech. The power spectrum of the speech is first flattened in order to annihilate the system characteristics, while retaining the source characteristics. Group delay analysis on this flattened spectrum picks the predominant pitch in the first pass and a comb filter is used to filter out the estimated pitch along with its harmonics. The residual spectrum is again analyzed for the next candidate pitch estimate in the second pass. The final pitch trajectories of the constituent speech utterances are formed using pitch grouping and post processing techniques. The performance of the proposed algorithm was evaluated on standard datasets using two metrics; pitch accuracy and standard deviation of fine pitch error. Our results show that the proposed algorithm is a promising pitch detection method in multipitch environment for real speech recordings.