Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality
This work addresses security and reliability issues for augmented reality users by benchmarking vision-language models against virtual content attacks, though it is incremental as it focuses on evaluation rather than proposing new methods.
The authors tackled the problem of contradictory virtual content attacks in augmented reality by creating the ContrAR benchmark with 312 real-world AR videos and evaluating 11 vision-language models, finding that while models show reasonable understanding, there is room for improvement in detection and reasoning, with challenges in balancing accuracy and latency.
Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.