Audio-Visual Deepfake Detection With Local Temporal Inconsistencies
This addresses the problem of detecting manipulated media for security and verification purposes, but it is incremental as it builds on existing detection methods.
The paper tackles audio-visual deepfake detection by capturing fine-grained temporal inconsistencies between audio and visual modalities, achieving effectiveness on DFDC and FakeAVCeleb datasets.
This paper proposes an audio-visual deepfake detection approach that aims to capture fine-grained temporal inconsistencies between audio and visual modalities. To achieve this, both architectural and data synthesis strategies are introduced. From an architectural perspective, a temporal distance map, coupled with an attention mechanism, is designed to capture these inconsistencies while minimizing the impact of irrelevant temporal subsequences. Moreover, we explore novel pseudo-fake generation techniques to synthesize local inconsistencies. Our approach is evaluated against state-of-the-art methods using the DFDC and FakeAVCeleb datasets, demonstrating its effectiveness in detecting audio-visual deepfakes.