ASLGSep 13, 2024

HLTCOE JHU Submission to the Voice Privacy Challenge 2024

arXiv:2409.08913v220 citationsh-index: 63
Originality Synthesis-oriented
AI Analysis

This addresses voice privacy for users in semi-white-box attack scenarios, but it is incremental as it combines existing methods.

The paper tackled the trade-off between speaker anonymization and emotion preservation in voice privacy systems, finding that voice conversion preserves emotion but fails at anonymization while TTS does the opposite. Their proposed random admixture system achieved a 40% EER for anonymization and 47% UAR for emotion preservation.

We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes