AS SDMay 21, 2020

Pitchtron: Towards audiobook generation from ordinary people's voices

arXiv:2005.10456v13.35 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work addresses audiobook generation for ordinary speakers by improving prosody transfer, though it is incremental with domain-specific focus.

The paper tackles prosody transfer for audiobook generation using training data from ordinary voices and reference audio from professionals, proposing hard and soft pitchtron models to address glitches in pitch, energy, and pause length. Results show AXY scores of 2.01 for hard pitchtron and 1.14 for soft pitchtron, outperforming the baseline GST method.

In this paper, we explore prosody transfer for audiobook generation under rather realistic condition where training DB is plain audio mostly from multiple ordinary people and reference audio given during inference is from professional and richer in prosody than training DB. To be specific, we explore transferring Korean dialects and emotive speech even though training set is mostly composed of standard and neutral Korean. We found that under this setting, original global style token method generates undesirable glitches in pitch, energy and pause length. To deal with this issue, we propose two models, hard and soft pitchtron and release the toolkit and corpus that we have developed. Hard pitchtron uses pitch as input to the decoder while soft pitchtron uses pitch as input to the prosody encoder. We verify the effectiveness of proposed models with objective and subjective tests. AXY score over GST is 2.01 and 1.14 for hard pitchtron and soft pitchtron respectively.

View on arXiv PDF Code

Similar