Emotional speech synthesis with rich and granularized control
This work addresses the need for more expressive and controllable emotional speech synthesis, which is incremental as it builds on existing TTS methods with specific algorithmic improvements.
The paper tackled the problem of controlling emotion in text-to-speech synthesis by proposing a method to flexibly adjust emotion categories and intensity, resulting in subjective evaluations showing superiority over conventional methods in emotional expressiveness and controllability.
This paper proposes an effective emotion control method for an end-to-end text-to-speech (TTS) system. To flexibly control the distinct characteristic of a target emotion category, it is essential to determine embedding vectors representing the TTS input. We introduce an inter-to-intra emotional distance ratio algorithm to the embedding vectors that can minimize the distance to the target emotion category while maximizing its distance to the other emotion categories. To further enhance the expressiveness of a target speech, we also introduce an effective interpolation technique that enables the intensity of a target emotion to be gradually changed to that of neutral speech. Subjective evaluation results in terms of emotional expressiveness and controllability show the superiority of the proposed algorithm to the conventional methods.