SDLGASMay 9, 2023

Enhancing Gappy Speech Audio Signals with Generative Adversarial Networks

arXiv:2305.05780v1
Originality Incremental advance
AI Analysis

This addresses the annoying issue of audio gaps in speech for applications requiring real-time or high-quality audio processing, though it is incremental as it adapts existing image in-painting methods to audio.

The paper tackled the problem of regenerating gaps up to 320ms in speech audio signals by translating audio into Mel-spectrograms and using image in-painting with GANs, achieving a mean opinion score of 3.737 for 240ms gaps, which is perceived as close to uninterrupted speech.

Gaps, dropouts and short clips of corrupted audio are a common problem and particularly annoying when they occur in speech. This paper uses machine learning to regenerate gaps of up to 320ms in an audio speech signal. Audio regeneration is translated into image regeneration by transforming audio into a Mel-spectrogram and using image in-painting to regenerate the gaps. The full Mel-spectrogram is then transferred back to audio using the Parallel-WaveGAN vocoder and integrated into the audio stream. Using a sample of 1300 spoken audio clips of between 1 and 10 seconds taken from the publicly-available LJSpeech dataset our results show regeneration of audio gaps in close to real time using GANs with a GPU equipped system. As expected, the smaller the gap in the audio, the better the quality of the filled gaps. On a gap of 240ms the average mean opinion score (MOS) for the best performing models was 3.737, on a scale of 1 (worst) to 5 (best) which is sufficient for a human to perceive as close to uninterrupted human speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes