A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

arXiv:2603.01482v12 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the security-critical problem of audio deepfake detection for speech systems by providing a reproducible benchmark, though it is incremental as it extends existing benchmarking efforts to a new domain.

The authors tackled the lack of standardized evaluation for self-supervised speech models in audio deepfake detection by introducing Spoof-SUPERB, a benchmark that systematically tests 20 models across multiple datasets, finding that large-scale discriminative models like XLS-R consistently outperform others and remain robust under acoustic degradations.

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes