SDCLASJun 16, 2022

Automatic Prosody Annotation with Pre-Trained Text-Speech Model

arXiv:2206.07956v111 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This work addresses the time-consuming and expensive process of prosody annotation for TTS systems, offering an incremental improvement in automation.

The paper tackles the problem of costly manual prosody annotation for text-to-speech synthesis by proposing an automatic method using a pre-trained text-speech model, achieving results comparable to human annotations and slightly improving TTS system performance.

Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes