ASCLSDDec 18, 2019

A Cycle-GAN Approach to Model Natural Perturbations in Speech for ASR Applications

arXiv:1912.11151v11 citations
Originality Synthesis-oriented
AI Analysis

This addresses robustness issues in ASR systems for real-world applications where speakers exhibit emotional or physical variations, but it is incremental as it applies an existing method to a specific domain.

The paper tackles the problem of natural perturbations in speech, such as laughter or creaky voice, degrading ASR performance by proposing a CycleGAN-based front-end to transform perturbed speech into normal speech, resulting in improved performance for four ASR systems on specific datasets.

Naturally introduced perturbations in audio signal, caused by emotional and physical states of the speaker, can significantly degrade the performance of Automatic Speech Recognition (ASR) systems. In this paper, we propose a front-end based on Cycle-Consistent Generative Adversarial Network (CycleGAN) which transforms naturally perturbed speech into normal speech, and hence improves the robustness of an ASR system. The CycleGAN model is trained on non-parallel examples of perturbed and normal speech. Experiments on spontaneous laughter-speech and creaky-speech datasets show that the performance of four different ASR systems improve by using speech obtained from CycleGAN based front-end, as compared to directly using the original perturbed speech. Visualization of the features of the laughter perturbed speech and those generated by the proposed front-end further demonstrates the effectiveness of our approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes