CVMay 4, 2025

MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution

arXiv:2505.02013v113 citationsh-index: 31
Originality Incremental advance
AI Analysis

This addresses the threat of deepfake-driven disinformation by improving detection accuracy, though it appears incremental as it builds on existing MLLM methods.

The paper tackles the problem of face forgery detection by proposing VLF-FFD, a vision-language fusion solution that integrates visual and textual modalities more effectively, achieving state-of-the-art performance in cross-dataset and intra-dataset evaluations.

Reliable face forgery detection algorithms are crucial for countering the growing threat of deepfake-driven disinformation. Previous research has demonstrated the potential of Multimodal Large Language Models (MLLMs) in identifying manipulated faces. However, existing methods typically depend on either the Large Language Model (LLM) alone or an external detector to generate classification results, which often leads to sub-optimal integration of visual and textual modalities. In this paper, we propose VLF-FFD, a novel Vision-Language Fusion solution for MLLM-enhanced Face Forgery Detection. Our key contributions are twofold. First, we present EFF++, a frame-level, explainability-driven extension of the widely used FaceForensics++ (FF++) dataset. In EFF++, each manipulated video frame is paired with a textual annotation that describes both the forgery artifacts and the specific manipulation technique applied, enabling more effective and informative MLLM training. Second, we design a Vision-Language Fusion Network (VLF-Net) that promotes bidirectional interaction between visual and textual features, supported by a three-stage training pipeline to fully leverage its potential. VLF-FFD achieves state-of-the-art (SOTA) performance in both cross-dataset and intra-dataset evaluations, underscoring its exceptional effectiveness in face forgery detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes