How Close are Other Computer Vision Tasks to Deepfake Detection?
This work addresses the problem of improving deepfake detection for researchers and practitioners by identifying better pre-trained models, though it is incremental in benchmarking existing methods.
The paper challenges the assumption that ImageNet-trained models generalize well for deepfake detection, finding that self-supervised models are more effective at separating data but risk overfitting after fine-tuning.
In this paper, we challenge the conventional belief that supervised ImageNet-trained models have strong generalizability and are suitable for use as feature extractors in deepfake detection. We present a new measurement, "model separability," for visually and quantitatively assessing a model's raw capacity to separate data in an unsupervised manner. We also present a systematic benchmark for determining the correlation between deepfake detection and other computer vision tasks using pre-trained models. Our analysis shows that pre-trained face recognition models are more closely related to deepfake detection than other models. Additionally, models trained using self-supervised methods are more effective in separation than those trained using supervised methods. After fine-tuning all models on a small deepfake dataset, we found that self-supervised models deliver the best results, but there is a risk of overfitting. Our results provide valuable insights that should help researchers and practitioners develop more effective deepfake detection models.