AS CL SD MLAug 2, 2018

Investigating accuracy of pitch-accent annotations in neural network-based speech synthesis and denoising effects

Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, Nobuyuki Nishizawa

arXiv:1808.00665v13.32 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of noisy annotations in speech synthesis for Japanese language applications, but it is incremental as it builds on existing WaveNet-based methods.

The study examined how noisy linguistic features affect a Japanese neural speech synthesis system, finding that test-set noise significantly degrades performance due to training-test mismatch, but adding noise during training can partially mitigate this effect and improve robustness.

We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system's performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model, and help the system perform better when linguistic features of the test set are noisy.

View on arXiv PDF

Similar