CLApr 4, 2022

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

arXiv:2204.01677v12 citationsh-index: 33
Originality Synthesis-oriented
AI Analysis

This addresses privacy concerns in speech data collection for machine learning, but it is incremental as it applies existing self-supervised methods to voice anonymization.

The study tackled the problem of privacy protection in speech data by using voice conversion models based on self-supervised speech representations to anonymize voices, achieving a low word error rate within 1% of the original and increasing equal error rates for speaker verification from 1.52% to 46.24% on LibriSpeech and from 3.75% to 45.84% on VCTK.

Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech. Converted voices retain a low word error rate within 1% of the original voice. Equal error rate increases from 1.52% to 46.24% on the LibriSpeech test set and from 3.75% to 45.84% on speakers from the VCTK corpus which signifies degraded performance on speaker verification. Lastly, we conduct experiments on dysarthric speech data to show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices for discriminating between healthy and pathological speech.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes