CVAIMay 19, 2025

Multiscale Adaptive Conflict-Balancing Model For Multimedia Deepfake Detection

arXiv:2505.12966v11 citationsh-index: 1ICMR
Originality Highly original
AI Analysis

This addresses the challenge of multimedia credibility undermined by deepfakes for security and media verification applications, representing a strong specific gain in multimodal detection.

The paper tackles the problem of unbalanced learning between modalities in multimodal deepfake detection by proposing an audio-visual joint learning method that mitigates modality conflicts and neglect through contrastive learning and an orthogonalization-multimodal pareto module. The model achieves an average accuracy of 95.5% across multiple datasets and shows superior cross-dataset generalization with absolute improvements of 8.0% and 7.7% over previous best approaches.

Advances in computer vision and deep learning have blurred the line between deepfakes and authentic media, undermining multimedia credibility through audio-visual forgery. Current multimodal detection methods remain limited by unbalanced learning between modalities. To tackle this issue, we propose an Audio-Visual Joint Learning Method (MACB-DF) to better mitigate modality conflicts and neglect by leveraging contrastive learning to assist in multi-level and cross-modal fusion, thereby fully balancing and exploiting information from each modality. Additionally, we designed an orthogonalization-multimodal pareto module that preserves unimodal information while addressing gradient conflicts in audio-video encoders caused by differing optimization targets of the loss functions. Extensive experiments and ablation studies conducted on mainstream deepfake datasets demonstrate consistent performance gains of our model across key evaluation metrics, achieving an average accuracy of 95.5% across multiple datasets. Notably, our method exhibits superior cross-dataset generalization capabilities, with absolute improvements of 8.0% and 7.7% in ACC scores over the previous best-performing approach when trained on DFDC and tested on DefakeAVMiT and FakeAVCeleb datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes