SDASNov 17, 2020

Optimizing voice conversion network with cycle consistency loss of speaker identity

arXiv:2011.08548v121 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of maintaining speaker identity during voice conversion, which is important for improving the naturalness and quality of converted speech for users of voice conversion systems. It is an incremental improvement to existing voice conversion methods.

This paper proposes a novel training scheme for voice conversion networks that minimizes both frame-level spectral loss and speaker identity loss. It introduces a cycle consistency loss to ensure the converted speech maintains the speaker identity of the reference speech, resulting in improved speaker similarity compared to baseline methods on CMU-ARCTIC and CSTR-VCTK corpora.

We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utterance level. While the proposed training scheme is applicable to any voice conversion networks, we formulate the study under the average model voice conversion framework in this paper. Experiments conducted on CMU-ARCTIC and CSTR-VCTK corpus confirm that the proposed method outperforms baseline methods in terms of speaker similarity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes