ASAISPJun 9, 2024

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

arXiv:2406.05699v16 citations
Originality Incremental advance
AI Analysis

This addresses a practical issue for users of zero-shot TTS systems by enhancing robustness to noisy inputs, though it is incremental as it builds on existing flow-matching methods.

The paper tackled the problem of noise robustness in zero-shot text-to-speech systems, where audio quality degrades with noisy prompts, and found that strategies like unsupervised pre-training and fine-tuning significantly improved intelligibility, speaker similarity, and overall quality compared to applying speech enhancement directly.

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes