Reduced Spatial Dependency for More General Video-level Deepfake Detection
This work addresses the safety concerns of deepfake detection for video content, offering a method to enhance generalization, though it appears incremental by building on existing temporal consistency approaches.
The paper tackles the problem of spatial bias hindering generalization in video-level deepfake detection by proposing Spatial Dependency Reduction (SDR), which integrates temporal consistency features from spatially-perturbed clusters, resulting in improved performance as demonstrated through extensive benchmarks.
As one of the prominent AI-generated content, Deepfake has raised significant safety concerns. Although it has been demonstrated that temporal consistency cues offer better generalization capability, existing methods based on CNNs inevitably introduce spatial bias, which hinders the extraction of intrinsic temporal features. To address this issue, we propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters, to reduce the dependency of the model on spatial information. Specifically, we design multiple Spatial Perturbation Branch (SPB) to construct spatially-perturbed feature clusters. Subsequently, we utilize the theory of mutual information and propose a Task-Relevant Feature Integration (TRFI) module to capture temporal features residing in similar latent space from these clusters. Finally, the integrated feature is fed into a temporal transformer to capture long-range dependencies. Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.