HLTCOE JHU Submission to the Voice Privacy Challenge 2024
This addresses voice privacy for users in semi-white-box attack scenarios, but it is incremental as it combines existing methods.
The paper tackled the trade-off between speaker anonymization and emotion preservation in voice privacy systems, finding that voice conversion preserves emotion but fails at anonymization while TTS does the opposite. Their proposed random admixture system achieved a 40% EER for anonymization and 47% UAR for emotion preservation.
We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.