ASSDApr 30, 2020

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

arXiv:2004.14617v173 citations
AI Analysis

This addresses a key limitation in speech synthesis for applications requiring natural and identity-preserving voice conversion, though it is an incremental advance over existing methods.

The paper tackles the problem of source speaker leakage in fine-grained prosody transfer for neural text-to-speech, proposing CopyCat, which achieves a 47% relative improvement in prosody transfer quality and 14% in preserving target speaker identity compared to a state-of-the-art model.

Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of $47\%$ in the quality of prosody transfer and $14\%$ in preserving the target speaker identity, while still maintaining the same naturalness.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes