SDAIMay 5

Deepfake Audio Detection Using Self-supervised Fusion Representations

arXiv:2605.0342030.9
Predicted impact top 75% in SD · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in audio deepfake detection, this work addresses the specific challenge of detecting independent manipulation of speech and environmental sounds, but the improvement over baseline is modest.

The paper proposes a dual-branch deepfake detection framework using self-supervised fusion representations (XLS-R and BEATs) to detect component-level manipulation of speech and environmental sounds. On the CompSpoofV2 test set, it achieves an F1-score of 70.20% and an environmental EER of 16.54%, outperforming the baseline.

This paper describes a submission to the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) 2026, which addresses component-level deepfake detection using the CompSpoofV2 dataset, where speech and environmental sounds may be independently manipulated. To address this challenge, a dual-branch deepfake detection framework is proposed to jointly model speech and environmental contextual representations from input audio. Two pretrained models, XLS-R for speech and BEATs for environmental sound, are used to extract complementary contextual representations. A Matching Head is introduced to model representation differences through statistical normalization and representation interaction, enabling estimation of the original class. In parallel, multi-head cross-attention enables effective information exchange between speech and environmental components. The refined representations are processed with residual connections and layer normalization, and passed to an AASIST classifier to predict speech-based and environment-based spoofing probabilities. The model outputs original, speech, and environment predictions. On the test set, the proposed system achieves an F1-score of 70.20% and an environmental EER of 16.54%, outperforming the baseline system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes