SDLGASOct 3, 2022

Efficient acoustic feature transformation in mismatched environments using a Guided-GAN

arXiv:2210.00721v31 citationsh-index: 22
Originality Incremental advance
AI Analysis

This addresses the challenge of deploying ASR systems in resource-scarce settings where training data and computational power are limited, offering an incremental improvement over existing methods like multi-style training.

The paper tackles the problem of improving automatic speech recognition in mismatched environments with limited data by using a Guided-GAN to transform acoustic features, achieving relative word error rate reductions of 11.5% to 19.7% with less than one hour of data.

We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes