A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Hashim Ali, Nithin Sai Adupa, Surya Subramani, Hafiz Malik

arXiv:2603.01482v13.32 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses the security-critical problem of audio deepfake detection for speech systems by providing a reproducible benchmark, though it is incremental as it extends existing benchmarking efforts to a new domain.

The authors tackled the lack of standardized evaluation for self-supervised speech models in audio deepfake detection by introducing Spoof-SUPERB, a benchmark that systematically tests 20 models across multiple datasets, finding that large-scale discriminative models like XLS-R consistently outperform others and remain robust under acoustic degradations.

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

View on arXiv PDF

Similar