CLApr 4, 2022

Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices

Abner Hernandez, Paula Andrea Pérez-Toro, Juan Camilo Vásquez-Correa, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

arXiv:2204.01677v10.62 citationsh-index: 33

Originality Synthesis-oriented

AI Analysis

This addresses privacy concerns in speech data collection for machine learning, but it is incremental as it applies existing self-supervised methods to voice anonymization.

The study tackled the problem of privacy protection in speech data by using voice conversion models based on self-supervised speech representations to anonymize voices, achieving a low word error rate within 1% of the original and increasing equal error rates for speaker verification from 1.52% to 46.24% on LibriSpeech and from 3.75% to 45.84% on VCTK.

Collecting speech data is an important step in training speech recognition systems and other speech-based machine learning models. However, the issue of privacy protection is an increasing concern that must be addressed. The current study investigates the use of voice conversion as a method for anonymizing voices. In particular, we train several voice conversion models using self-supervised speech representations including Wav2Vec2.0, Hubert and UniSpeech. Converted voices retain a low word error rate within 1% of the original voice. Equal error rate increases from 1.52% to 46.24% on the LibriSpeech test set and from 3.75% to 45.84% on speakers from the VCTK corpus which signifies degraded performance on speaker verification. Lastly, we conduct experiments on dysarthric speech data to show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices for discriminating between healthy and pathological speech.

View on arXiv PDF

Similar