SDMMMay 22

MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

arXiv:2605.2320147.7Has Code
AI Analysis

For researchers and practitioners in audio deepfake detection, this work addresses the real-world challenge of mixed audio, where existing SSL-based methods fail.

The paper introduces MixFake, a benchmark for audio deepfake detection in mixed audio (speech with background music/noise), and proposes a Multi-stream Prompt Tuning framework that improves SSL models by injecting signal-level priors, achieving 0.95% EER in foreground detection and 7.72% absolute improvement in complex background detection.

Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes