AS AI CL SDSep 11, 2023

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu

arXiv:2309.05423v22.33 citationsh-index: 43Has Code

Originality Incremental advance

AI Analysis

This addresses the need for efficient and consistent prosody annotation in TTS systems, offering a robust solution that is incremental in improving existing methods.

The paper tackles the problem of labor-intensive manual prosody annotation in Text-to-Speech by proposing a two-stage automatic annotation pipeline, achieving state-of-the-art performance with f1 scores of 0.72 for Prosodic Word and 0.93 for Prosodic Phrase boundaries.

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.

View on arXiv PDF Code

Similar