SDLGASMar 3, 2023

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

arXiv:2303.01664v249 citationsh-index: 51
AI Analysis

This work addresses the challenge of generating high-quality training data for speech generation from degraded web sources, offering a practical solution for improving text-to-speech systems.

The authors tackled the problem of speech restoration to convert degraded speech into high-quality signals, proposing Miipher, which uses self-supervised speech and text representations to achieve robustness against various degradations and enable training of high-quality text-to-speech models from web-collected speech samples.

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes