The MSXF TTS System for ICASSP 2022 ADD Challenge
This work addresses the challenge of detecting synthetic speech for security applications, but it is incremental as it builds on existing methods and focuses on a specific competition task.
The authors tackled the problem of generating synthetic speech to fool audio deepfake detectors in the ICASSP 2022 ADD Challenge, achieving fourth place by using a VITS-based TTS system with a constraint loss and analyzing how speech speed and volume affect spoofing ability.
This paper presents our MSXF TTS system for Task 3.1 of the Audio Deep Synthesis Detection (ADD) Challenge 2022. We use an end to end text to speech system, and add a constraint loss to the system when training stage. The end to end TTS system is VITS, and the pre-training self-supervised model is wav2vec 2.0. And we also explore the influence of the speech speed and volume in spoofing. The faster speech means the less the silence part in audio, the easier to fool the detector. We also find the smaller the volume, the better spoofing ability, though we normalize volume for submission. Our team is identified as C2, and we got the fourth place in the challenge.