Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
This addresses the problem of limited emotional expressiveness in TTS systems for users needing more nuanced and controllable speech synthesis, representing an incremental improvement by integrating continuous emotional dimensions into an existing framework.
The paper tackled the problem of emotional text-to-speech systems struggling to capture the full spectrum of human emotions by proposing a language model-based TTS framework that synthesizes speech across a broad range of emotional styles with user control along continuous pleasure, arousal, and dominance dimensions. The result showed that the framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.
Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.