SDCVASIVAug 4, 2025

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

arXiv:2508.02000v14 citationsh-index: 13
Originality Highly original
AI Analysis

This work addresses the problem of detecting subtle deepfakes in multimedia for security and verification applications, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the challenge of localizing audio-visual deepfakes in videos with partial manipulations, where fake regions span only a few frames, by proposing HBMNet, which integrates audio-visual encoding, multi-scale temporal cues, and bidirectional boundary modeling to outperform existing methods like BA-TFD and UMMAFormer.

Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues and bidirectional boundary-content relationships. Experiments show that encoding and fusion primarily improve precision, while frame-level supervision boosts recall. Each module (audio-visual fusion, temporal scales, bi-directionality) contributes complementary benefits, collectively enhancing localization performance. HBMNet outperforms BA-TFD and UMMAFormer and shows improved potential scalability with more training data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes