AI SDAug 24, 2025

ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection

Xin Zhang, Jiaming Chu, Jian Zhao, Yuchu Jiang, Xu Yang, Lei Jin, Chi Zhang, Xuelong Li

arXiv:2508.17282v15 citationsh-index: 6Vicinagearth

Originality Incremental advance

AI Analysis

This addresses the critical need for reliable deepfake detection in multimedia, particularly for security and media integrity applications, though it appears incremental as it builds on existing multimodal approaches.

The paper tackles the problem of detecting deepfakes in audio-visual content by proposing ERF-BA-TFD+, a multimodal model that processes both audio and video features to improve detection accuracy and robustness, achieving state-of-the-art results and winning first place in a competition.

Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.

View on arXiv PDF

Similar