MMApr 10

Generalizing Video DeepFake Detection by Self-generated Audio-Visual Pseudo-Fakes

arXiv:2604.0911059.3
AI Analysis

This work addresses the challenge of detecting deepfakes in videos for security and media integrity, offering an incremental improvement by generating diverse training data without real deepfakes.

The paper tackles the problem of video deepfake detection by addressing limited generalizability in real-world scenarios, proposing a method that uses self-generated audio-visual pseudo-fakes to enhance model performance, achieving an average improvement of up to 7.4% on standard datasets.

Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes