Mahmoud Al Ismail

5papers

159citations

Novelty40%

AI Score24

Ranked #178,341 of 205,806 authors (top 87%)#1,139 in AS (top 82%)

5 Papers

ASAug 11, 2023

Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

Mohammad Soleymanpour, Mahmoud Al Ismail, Fahimeh Bahmaninezhad et al.

We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests.

ASOct 29, 2020

Interpreting glottal flow dynamics for detecting COVID-19 from voice

Soham Deshmukh, Mahmoud Al Ismail, Rita Singh

In the pathogenesis of COVID-19, impairment of respiratory functions is often one of the key symptoms. Studies show that in these cases, voice production is also adversely affected -- vocal fold oscillations are asynchronous, asymmetrical and more restricted during phonation. This paper proposes a method that analyzes the differential dynamics of the glottal flow waveform (GFW) during voice production to identify features in them that are most significant for the detection of COVID-19 from voice. Since it is hard to measure this directly in COVID-19 patients, we infer it from recorded speech signals and compare it to the GFW computed from physical model of phonation. For normal voices, the difference between the two should be minimal, since physical models are constructed to explain phonation under assumptions of normalcy. Greater differences implicate anomalies in the bio-physical factors that contribute to the correctness of the physical model, revealing their significance indirectly. Our proposed method uses a CNN-based 2-step attention model that locates anomalies in time-feature space in the difference of the two GFWs, allowing us to infer their potential as discriminative features for classification. The viability of this method is demonstrated using a clinically curated dataset of COVID-19 positive and negative subjects.

ASOct 21, 2020

Detection of COVID-19 through the analysis of vocal fold oscillations

Mahmoud Al Ismail, Soham Deshmukh, Rita Singh

Phonation, or the vibration of the vocal folds, is the primary source of vocalization in the production of voiced sounds by humans. It is a complex bio-mechanical process that is highly sensitive to changes in the speaker's respiratory parameters. Since most symptomatic cases of COVID-19 present with moderate to severe impairment of respiratory functions, we hypothesize that signatures of COVID-19 may be observable by examining the vibrations of the vocal folds. Our goal is to validate this hypothesis, and to quantitatively characterize the changes observed to enable the detection of COVID-19 from voice. For this, we use a dynamical system model for the oscillation of the vocal folds, and solve it using our recently developed ADLES algorithm to yield vocal fold oscillation patterns directly from recorded speech. Experimental results on a clinically curated dataset of COVID-19 positive and negative subjects reveal characteristic patterns of vocal fold oscillations that are correlated with COVID-19. We show that these are prominent and discriminative enough that even simple classifiers such as logistic regression yields high detection accuracies using just the recordings of isolated extended vowels.

CVJul 12, 2018

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Yandong Wen, Mahmoud Al Ismail, Weiyang Liu et al.

We propose a novel framework, called Disjoint Mapping Network (DIMNet), for cross-modal biometric matching, in particular of voices and faces. Different from the existing methods, DIMNet does not explicitly learn the joint relationship between the modalities. Instead, DIMNet learns a shared representation for different modalities by mapping them individually to their common covariates. These shared representations can then be used to find the correspondences between the modalities. We show empirically that DIMNet is able to achieve better performance than other current methods, with the additional benefits of being conceptually simpler and less data-intensive.

CVJul 12, 2018

Optimal Strategies for Matching and Retrieval Problems by Comparing Covariates

Yandong Wen, Mahmoud Al Ismail, Bhiksha Raj et al.

In many retrieval problems, where we must retrieve one or more entries from a gallery in response to a probe, it is common practice to learn to do by directly comparing the probe and gallery entries to one another. In many situations the gallery and probe have common covariates -- external variables that are common to both. In principle it is possible to perform the retrieval based merely on these covariates. The process, however, becomes gated by our ability to recognize the covariates for the probe and gallery entries correctly. In this paper we analyze optimal strategies for retrieval based only on matching covariates, when the recognition of the covariates is itself inaccurate. We investigate multiple problems: recovering one item from a gallery of $N$ entries, matching pairs of instances, and retrieval from large collections. We verify our analytical formulae through experiments to verify their correctness in practical settings.