CV AIMay 21, 2025

AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection

Zhipei Xu, Xuanyu Zhang, Qing Huang, Xing Zhou, Jian Zhang

arXiv:2505.15173v33 citationsh-index: 12

Originality Highly original

AI Analysis

This addresses the growing threat of realistic full-body synthetic videos to information authenticity, offering a detection method that avoids annotation bias and improves generalization.

The paper tackles the problem of detecting synthetic human-centric videos by proposing AvatarShield, a multimodal detection framework that uses Group Relative Policy Optimization to train LLMs with binary labels instead of dense textual supervision, achieving superior performance in in-domain and cross-domain settings on a new benchmark of 15K videos.

Recent advances in Artificial Intelligence Generated Content have led to highly realistic synthetic videos, particularly in human-centric scenarios involving speech, gestures, and full-body motion, posing serious threats to information authenticity and public trust. Unlike DeepFake techniques that focus on localized facial manipulation, human-centric video generation methods can synthesize entire human bodies with controllable movements, enabling complex interactions with environments, objects, and even other people. However, existing detection methods largely overlook the growing risks posed by such full-body synthetic content. Meanwhile, a growing body of research has explored leveraging LLMs for interpretable fake detection, aiming to explain decisions in natural language. Yet these approaches heavily depend on supervised fine-tuning, which introduces limitations such as annotation bias, hallucinated supervision, and weakened generalization. To address these challenges, we propose AvatarShield, a novel multimodal human-centric synthetic video detection framework that eliminates the need for dense textual supervision by adopting Group Relative Policy Optimization, enabling LLMs to develop reasoning capabilities from simple binary labels. Our architecture combines a discrete vision tower for high-level semantic inconsistencies and a residual extractor for fine-grained artifact analysis. We further introduce FakeHumanVid, a large-scale benchmark containing 15K real and synthetic videos across nine state-of-the-art human generation methods driven by text, pose, or audio. Extensive experiments demonstrate that AvatarShield outperforms existing methods in both in-domain and cross-domain settings.

View on arXiv PDF

Similar