Self Voice Conversion as an Attack against Neural Audio Watermarking
This work addresses a security problem for audio watermarking systems by revealing a significant threat from deep learning-based attacks, which is incremental as it builds on existing robustness assessments.
The paper tackled the vulnerability of neural audio watermarking to a novel attack using self voice conversion, which remaps a speaker's voice to the same identity while altering acoustic characteristics, and demonstrated that this attack severely degrades the reliability of state-of-the-art watermarking approaches.
Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.