SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization
This work addresses privacy protection in voice data for users, but it is incremental as it builds on existing k-means methods with a multi-model approach.
The paper tackled voice anonymization by proposing SEF-MK, a speaker-embedding-free framework using multiple k-means models to anonymize SSL representations, which better preserved linguistic and emotional content but increased vulnerability to privacy attacks.
Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user's viewpoint. However, from the attacker's perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.