SDAIASApr 7, 2024

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

arXiv:2404.04904v229 citationsh-index: 11EMNLP
Originality Incremental advance
AI Analysis

This addresses the outdated dataset issue for audio deepfake detection, which is crucial for preventing misuse of synthetic voices in security and privacy contexts, representing an incremental improvement.

The paper tackles the problem of audio deepfake detection by constructing a new cross-domain dataset of over 300 hours of speech from advanced zero-shot TTS models, achieving equal error rates as low as 4.1% with Wav2Vec2-large and 6.5% with Whisper-medium through attack-augmented training.

Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1\% and 6.5\% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes