SDLGDec 17, 2025

BEAT2AASIST model with layer fusion for ESDD 2026 Challenge

arXiv:2512.15180v13 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This addresses the risk of manipulated environmental sounds for audio security applications, but appears incremental as it builds on existing BEATs-AASIST architecture.

The authors tackled environmental sound deepfake detection for the ESDD 2026 Challenge by proposing BEAT2AASIST, which extends BEATs-AASIST with dual-branch processing, layer fusion strategies, and data augmentation, achieving competitive performance on official test sets.

Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large-scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. In addition, vocoder-based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes