PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS
This addresses the issue of limited expressiveness in speech synthesis for applications requiring natural and controllable voice generation, representing an incremental improvement over existing methods.
The paper tackles the problem of low variance in synthesized speech from pitch-controllable TTS models by proposing PITS, which uses variational inference to model pitch without fundamental frequency, resulting in high-quality speech indistinguishable from ground truth and high pitch-controllability.
Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code, audio samples, and demo are available at https://github.com/anonymous-pits/pits.