CVDec 30, 2025

T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

Changzhen Li, Yuecong Min, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

arXiv:2512.23953v1h-index: 10

Originality Incremental advance

AI Analysis

This work addresses security risks in T2V models for AI safety and video generation applications, but it is incremental as it applies known adversarial attack concepts to a new domain.

The paper tackles the vulnerability of Text-to-Video (T2V) diffusion models to adversarial attacks by introducing T2VAttack, which evaluates attacks from semantic and temporal perspectives, and finds that minor prompt modifications can cause substantial degradation in semantic fidelity and temporal dynamics.

The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

View on arXiv PDF

Similar