CV CRMay 30, 2025

Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models

Ying Yang, Jie Zhang, Xiao Lv, Di Lin, Tao Xiang, Qing Guo

arXiv:2505.24227v16.21 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the problem of adversarial robustness for vision-language models, offering a novel attack method that is incremental in improving naturalness over prior approaches.

The paper tackles the challenge of generating natural adversarial samples for vision-language pre-training models by proposing LightD, a framework that uses ChatGPT and a relighting model to create semantically guided lighting adjustments, achieving effective attacks on tasks like image captioning and visual question answering across various models.

While adversarial attacks on vision-and-language pretraining (VLP) models have been explored, generating natural adversarial samples crafted through realistic and semantically meaningful perturbations remains an open challenge. Existing methods, primarily designed for classification tasks, struggle when adapted to VLP models due to their restricted optimization spaces, leading to ineffective attacks or unnatural artifacts. To address this, we propose \textbf{LightD}, a novel framework that generates natural adversarial samples for VLP models via semantically guided relighting. Specifically, LightD leverages ChatGPT to propose context-aware initial lighting parameters and integrates a pretrained relighting model (IC-light) to enable diverse lighting adjustments. LightD expands the optimization space while ensuring perturbations align with scene semantics. Additionally, gradient-based optimization is applied to the reference lighting image to further enhance attack effectiveness while maintaining visual naturalness. The effectiveness and superiority of the proposed LightD have been demonstrated across various VLP models in tasks such as image captioning and visual question answering.

View on arXiv PDF

Similar