Context-Aware Prosody Correction for Text-Based Speech Editing
This work addresses the issue of prosody mismatches in speech editing tools, which is incremental as it builds on existing methods to enhance user experience in audio editing.
The paper tackled the problem of unnatural prosody in text-based speech editing by proposing a context-aware method that uses neural networks to generate prosody features and apply signal manipulation, resulting in improved naturalness as evaluated through subjective listening tests.
Text-based speech editors expedite the process of editing speech recordings by permitting editing via intuitive cut, copy, and paste operations on a speech transcript. A major drawback of current systems, however, is that edited recordings often sound unnatural because of prosody mismatches around edited regions. In our work, we propose a new context-aware method for more natural sounding text-based editing of speech. To do so, we 1) use a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control 2) use the generated features to control a standard pitch-shift and time-stretch method and 3) apply a denoising neural network to remove artifacts induced by the signal manipulation to yield a high-fidelity result. We evaluate our approach using a subjective listening test, provide a detailed comparative analysis, and conclude several interesting insights.