CVMMApr 9

MSCT: Differential Cross-Modal Attention for Deepfake Detection

arXiv:2604.0774118.8
AI Analysis

This addresses deepfake detection for multimedia security, but it appears incremental as it builds on existing multi-modal methods.

The paper tackled the problem of insufficient feature extraction and modal alignment deviation in audio-visual deepfake detection by proposing a multi-scale cross-modal transformer encoder (MSCT), which demonstrated competitive performance on the FakeAVCeleb dataset.

Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes