CVMay 8

Towards multi-modal forgery representation learning for AI-generated video detection and localization

arXiv:2605.0723243.2
AI Analysis

For security and media forensics, this work addresses the need for reliable detection of partially manipulated AI-generated videos across modalities.

The paper tackles AI-generated video detection and localization, proposing a multi-modal architecture that integrates LMM semantic, spatio-temporal visual, and multi-scale partial-spoof audio branches. It outperforms existing state-of-the-art methods in both detection and fine-grained temporal localization.

Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes