SDLGASFeb 15, 2022

SpeechPainter: Text-conditioned Speech Inpainting

arXiv:2202.07273v236 citations
AI Analysis

This addresses speech inpainting for audio editing applications, though it appears incremental as it builds on existing text-to-speech and inpainting techniques.

The paper tackles the problem of filling gaps up to one second in speech samples by using a text-conditioned model called SpeechPainter, which maintains speaker identity and prosody while outperforming adaptive TTS baselines in human preference and MOS tests.

We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes