Environment Aware Text-to-Speech Synthesis
This addresses the need for more adaptable TTS systems in varied acoustic settings, though it is incremental as it builds on existing neural TTS methods.
The study tackled the problem of generating speech that adapts to specific acoustic environments by modeling environment as a condition in neural TTS, and demonstrated effective disentanglement of speaker and environment factors in synthesized speech.
This study aims at designing an environment-aware text-to-speech (TTS) system that can generate speech to suit specific acoustic environments. It is also motivated by the desire to leverage massive data of speech audio from heterogeneous sources in TTS system development. The key idea is to model the acoustic environment in speech audio as a factor of data variability and incorporate it as a condition in the process of neural network based speech synthesis. Two embedding extractors are trained with two purposely constructed datasets for characterization and disentanglement of speaker and environment factors in speech. A neural network model is trained to generate speech from extracted speaker and environment embeddings. Objective and subjective evaluation results demonstrate that the proposed TTS system is able to effectively disentangle speaker and environment factors and synthesize speech audio that carries designated speaker characteristics and environment attribute. Audio samples are available online for demonstration https://daxintan-cuhk.github.io/Environment-Aware-TTS/ .