SDAIMar 8

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

arXiv:2603.07551v1
Predicted impact top 52% in SD · last 90 daysOriginality Highly original
AI Analysis

This work addresses a critical privacy concern for individuals whose voices could be cloned by zero-shot TTS models, proposing a method to prevent the generation of specific identities.

The authors introduce Speech Generation Speaker Poisoning (SGSP) to remove specific speaker identities from zero-shot Text-to-Speech (TTS) models, which are vulnerable to privacy risks due to their ability to reconstruct voices from reference prompts. They evaluate inference-time filtering and parameter-modification baselines, achieving strong privacy for up to 15 speakers but encountering scalability limits at 100 speakers due to increased identity overlap.

Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes