HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang, Christopher Leckie

arXiv:2602.01032v12.2

Originality Incremental advance

AI Analysis

This work addresses security and trust issues in audio verification by improving detection of synthetic speech, though it is incremental as it builds on existing self-supervised models.

The paper tackled the problem of detecting audio deepfakes by addressing the oversight of temporal and hierarchical dependencies in existing detectors, achieving state-of-the-art performance with 1.93% and 6.87% EER on ASVspoof 2021 DF and In-the-Wild datasets, improving over baselines by 36.6% and 22.5% respectively.

Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.

View on arXiv PDF

Similar