Diffusion Reconstruction towards Generalizable Audio Deepfake Detection
For researchers in audio forensics, this work provides a novel framework to improve generalization of ADD systems against evolving generative models.
The paper tackles the challenge of robust generalization in Audio Deepfake Detection (ADD) against unseen attacks. By using diffusion-based reconstruction to generate hard samples and applying multi-layer feature aggregation with Regularization-Assisted Contrastive Learning, the method achieves a significant reduction in average Equal Error Rate (EER) compared to the baseline.
Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.