SD LG ASFeb 15, 2022

SpeechPainter: Text-conditioned Speech Inpainting

Zalán Borsos, Matt Sharifi, Marco Tagliasacchi

arXiv:2202.07273v215.136 citations

Originality Incremental advance

AI Analysis

This addresses speech inpainting for audio editing applications, though it appears incremental as it builds on existing text-to-speech and inpainting techniques.

The paper tackles the problem of filling gaps up to one second in speech samples by using a text-conditioned model called SpeechPainter, which maintains speaker identity and prosody while outperforming adaptive TTS baselines in human preference and MOS tests.

We propose SpeechPainter, a model for filling in gaps of up to one second in speech samples by leveraging an auxiliary textual input. We demonstrate that the model performs speech inpainting with the appropriate content, while maintaining speaker identity, prosody and recording environment conditions, and generalizing to unseen speakers. Our approach significantly outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests.

View on arXiv PDF

Similar