CVNov 15, 2024

DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

arXiv:2411.10193v214.112 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This addresses the threat of sophisticated audio-visual deepfakes for online security and trust, offering an incremental advance over existing detection methods.

The paper tackles the problem of detecting and localizing deepfakes that manipulate both audio and visual modalities, presenting DiMoDif, a framework that uses cross-modal incongruities to achieve state-of-the-art performance, such as a 30.5 AUC improvement on AV-Deepfake1M and 47.88 AP@0.75 on temporal localization.

Deepfake technology has rapidly advanced and poses significant threats to information integrity and trust in online multimedia. While significant progress has been made in detecting deepfakes, the simultaneous manipulation of audio and visual modalities, sometimes at small parts or in subtle ways, presents highly challenging detection scenarios. To address these challenges, we present DiMoDif, an audio-visual deepfake detection framework that leverages the inter-modality differences in machine perception of speech, based on the assumption that in real samples -- in contrast to deepfakes -- visual and audio signals coincide in terms of information. DiMoDif leverages features from deep networks that specialize in visual and audio speech recognition to spot frame-level cross-modal incongruities, and in that way to temporally localize the deepfake forgery. To this end, we devise a hierarchical cross-modal fusion network, integrating adaptive temporal alignment modules and a learned discrepancy mapping layer to explicitly model the subtle differences between visual and audio representations. Then, the detection model is optimized through a composite loss function accounting for frame-level detections and fake intervals localization. DiMoDif outperforms the state-of-the-art on the Deepfake Detection task by 30.5 AUC on the highly challenging AV-Deepfake1M, while it performs exceptionally on FakeAVCeleb and LAV-DF. On the Temporal Forgery Localization task, it outperforms the state-of-the-art by 47.88 AP@0.75 on AV-Deepfake1M, and performs on-par on LAV-DF. Code available at https://github.com/mever-team/dimodif.

View on arXiv PDF Code

Similar