In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
This addresses the challenge of creating multiple speech styles efficiently for applications like broadcasting, though it is incremental as it builds on existing neural TTS methods.
The paper tackles the problem of synthesizing newscaster-style speech with limited data by proposing a bi-style text-to-speech model that uses a one-hot vector to factorize neutral and newscaster styles, and it reduces the gap in perceived style-appropriateness between natural recordings and neutral synthesis by approximately two-thirds.
Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and time-consuming. In this paper different styles of speech are analysed based on prosodic variations, from this a model is proposed to synthesise speech in the style of a newscaster, with just a few hours of supplementary data. We pose the problem of synthesising in a target style using limited data as that of creating a bi-style model that can synthesise both neutral-style and newscaster-style speech via a one-hot vector which factorises the two styles. We also propose conditioning the model on contextual word embeddings, and extensively evaluate it against neutral NTTS, and neutral concatenative-based synthesis. This model closes the gap in perceived style-appropriateness between natural recordings for newscaster-style of speech, and neutral speech synthesis by approximately two-thirds.