SD ASApr 9

DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection

Yassine El Kheir, Arnab Das, Yixuan Xiao, Xin Wang, Feidi Kallel, Enes Erdem Erdogan, Ngoc Thang Vu, Tim Polzehl, Sebastian Moeller

arXiv:2604.0845050.4Has Code

AI Analysis

This work addresses the need for standardized tools in deepfake audio detection research, though it is incremental as it builds on existing methods without introducing new detection paradigms.

The authors tackled the problem of limited reproducibility and benchmarking in deepfake audio detection by developing DeepFense, a unified open-source toolkit, and found that pre-trained front-end feature extractors dominate performance variance while revealing severe biases in models related to audio quality, speaker gender, and language.

Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.

View on arXiv PDF Code

Similar