Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis
This addresses the problem of generating natural sarcastic speech for applications like entertainment and human-computer interaction, but it is incremental as it builds on existing TTS and sarcasm detection methods.
The study tackled the challenge of synthesizing sarcastic speech by integrating feedback loss from a bi-modal sarcasm detector into TTS training and using a two-stage fine-tuning process, resulting in improved quality, naturalness, and sarcasm-awareness in synthesized speech as shown by objective and subjective evaluations.
Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.