SDLGASJul 10, 2022

A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

arXiv:2207.04356v124 citationsh-index: 55Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses voice conversion for speech processing applications, but it is incremental as it focuses on comparative analysis rather than introducing a new method.

The study tackled the problem of voice conversion by comparing self-supervised speech representations as alternatives to expensive supervised methods, finding that these representations are competitive with state-of-the-art systems and showing improvements in any-to-any settings with post-discretization.

We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes