CVMMMay 29, 2020

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

arXiv:2005.14405v3239 citations
AI Analysis

This addresses the problem of detecting manipulated videos for security and media integrity, though it is incremental as it builds on existing modality-based detection approaches.

The paper tackles deepfake video detection by measuring audio-visual dissimilarity, achieving up to 7% improvement over state-of-the-art methods on datasets like DFDC and DeepFake-TIMIT, and also demonstrates temporal forgery localization.

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, eg, loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as an aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes