CVAILGIVApr 19

Towards Generalizable Deepfake Image Detection with Vision Transformers

arXiv:2604.1737626.1h-index: 15
AI Analysis

For practitioners needing robust deepfake detection, this work provides a generalizable method that significantly improves over existing approaches on a challenging benchmark.

The paper addresses the challenge of generalizable deepfake image detection by using an ensemble of fine-tuned vision transformers (DINOv2, AIMv2, OpenCLIP ViT-L/14). The ensemble achieves 96.77% AUC and 9% EER on the DF-Wild test set, outperforming the prior state-of-the-art Effort by 7.05% AUC and 8% EER.

In today's day and age, we face a challenge in detecting deepfake images because of the fast evolution of modern generative models and the poor generalization capability of existing methods. In this paper, we use an ensemble of fine-tuned vision transformers like DINOv2, AIMv2 and OpenCLIP's ViT-L/14 to create generalizable method to detect deepfakes. We use the DF-Wild dataset released as part of the IEEE SP Cup 2025, because it uses a challenging and diverse set of manipulations and generation techniques. We started our experiments with CNN classifiers trained on spatial features. Experimental results show that our ensemble outperforms individual models and strong CNN baselines, achieving an AUC of 96.77% and an Equal Error Rate (EER) of just 9% on the DF-Wild test set, beating the state-of-the-art deepfake detection algorithm Effort by 7.05% and 8% in AUC and EER respectively. This was the winning solution for SP Cup, presented at ICASSP 2025.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes