AS SDJul 4, 2021

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

arXiv:2107.01554v216.454 citationsh-index: 90Has Code

Originality Incremental advance

AI Analysis

This addresses the need for high-quality speech editing tools for applications like audio production or voice assistants, but it is incremental as it builds upon existing neural text-to-speech frameworks.

The paper tackles the problem of editing speech by deleting, inserting, or replacing words without degrading quality, and results show that EditSpeech outperforms baselines with lower spectral distortion and better speech quality in English and Chinese multi-speaker scenarios.

This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/EditSpeech/ .

View on arXiv PDF Code

Similar