DiffAnon: Diffusion-based Prosody Control for Voice Anonymization
For voice anonymization researchers and practitioners, DiffAnon offers the first structured, interpolatable inference-time control over the utility-privacy trade-off, addressing a key limitation of existing methods.
DiffAnon introduces a diffusion-based voice anonymization method that provides explicit, continuous control over prosody preservation at inference time, achieving strong utility with competitive privacy across controllable operating points.
To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.