Representation Mixing for TTS Synthesis
This addresses pronunciation control issues in TTS deployment, offering a practical solution for speech synthesis applications.
The paper tackled the limitation of choosing between character or phoneme inputs in TTS systems by introducing representation mixing, a method that combines multiple linguistic representations in a single encoder, allowing flexible input choices during inference and showing efficacy in experiments on a public audiobook corpus.
Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.