SD AIMar 8

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

arXiv:2603.07551v19.1

Predicted impact top 52% in SD · last 90 daysOriginality Highly original

AI Analysis

This work addresses a critical privacy concern for individuals whose voices could be cloned by zero-shot TTS models, proposing a method to prevent the generation of specific identities.

The authors introduce Speech Generation Speaker Poisoning (SGSP) to remove specific speaker identities from zero-shot Text-to-Speech (TTS) models, which are vulnerable to privacy risks due to their ability to reconstruct voices from reference prompts. They evaluate inference-time filtering and parameter-modification baselines, achieving strong privacy for up to 15 speakers but encountering scalability limits at 100 speakers due to increased identity overlap.

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

View on arXiv PDF

Similar