SDAug 31, 2023
Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech SynthesisWeiqin Li, Shun Lei, Qiaochu Huang et al.
The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict reasonable spontaneous behavior from text.
SDJul 18, 2024
Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language ModelsWeiqin Li, Peiji Yang, Yicheng Zhong et al.
Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.
SDApr 14, 2025
AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech SynthesisDan Luo, Chengyuan Ma, Weiqin Li et al.
With the advancement of speech synthesis technology, users have higher expectations for the naturalness and expressiveness of synthesized speech. But previous research ignores the importance of prompt selection. This study proposes a text-to-speech (TTS) framework based on Retrieval-Augmented Generation (RAG) technology, which can dynamically adjust the speech style according to the text content to achieve more natural and vivid communication effects. We have constructed a speech style knowledge database containing high-quality speech samples in various contexts and developed a style matching scheme. This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and Moka, to match with samples in the knowledge database, selecting the most appropriate speech style for synthesis. Furthermore, our empirical research validates the effectiveness of the proposed method. Our demo can be viewed at: https://thuhcsi.github.io/icme2025-AutoStyle-TTS
IVJul 3, 2019
Region-Manipulated Fusion Networks for Pancreatitis RecognitionJian Wang, Xiaoyao Li, Xiangbo Shu et al.
This work first attempts to automatically recognize pancreatitis on CT scan images. However, different form the traditional object recognition, such pancreatitis recognition is challenging due to the fine-grained and non-rigid appearance variability of the local diseased regions. To this end, we propose a customized Region-Manipulated Fusion Networks (RMFN) to capture the key characteristics of local lesion for pancreatitis recognition. Specifically, to effectively highlight the imperceptible lesion regions, a novel region-manipulated scheme in RMFN is proposed to force the lesion regions while weaken the non-lesion regions by ceaselessly aggregating the multi-scale local information onto feature maps. The proposed scheme can be flexibly equipped into the existing neural networks, such as AlexNet and VGG. To evaluate the performance of the propose method, a real CT image database about pancreatitis is collected from hospitals \footnote{The database is available later}. And experimental results on such database well demonstrate the effectiveness of the proposed method for pancreatitis recognition.