Maria Gorlatova

CV
h-index16
10papers
111citations
Novelty39%
AI Score51

10 Papers

68.5CVApr 7Code
Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Yanming Xiu, Zhengayuan Jiang, Neil Zhenqiang Gong et al.

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

CVJan 30
User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

Junfeng Lin, Yanming Xiu, Maria Gorlatova

Open-set object detection (OSOD) localizes objects while identifying and rejecting unknown classes at inference. While recent OSOD models perform well on benchmarks, their behavior under realistic user prompting remains underexplored. In interactive XR settings, user-generated prompts are often ambiguous, underspecified, or overly detailed. To study prompt-conditioned robustness, we evaluate two OSOD models, GroundingDINO and YOLO-E, on real-world XR images and simulate diverse user prompting behaviors using vision-language models. We consider four prompt types: standard, underdetailed, overdetailed, and pragmatically ambiguous, and examine the impact of two enhancement strategies on these prompts. Results show that both models exhibit stable performance under underdetailed and standard prompts, while they suffer degradation under ambiguous prompts. Overdetailed prompts primarily affect GroundingDINO. Prompt enhancement substantially improves robustness under ambiguity, yielding gains exceeding 55% mIoU and 41% average confidence. Based on the findings, we propose several prompting strategies and prompt enhancement methods for OSOD models in XR environments.

CVJan 22, 2025Code
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality

Yanming Xiu, Tim Scargill, Maria Gorlatova

In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users' ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users' ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.

HCSep 29, 2021Code
Here To Stay: Measuring Hologram Stability in Markerless Smartphone Augmented Reality

Tim Scargill, Jiasi Chen, Maria Gorlatova

Markerless augmented reality (AR) has the potential to provide engaging experiences and improve outcomes across a wide variety of industries; the overlaying of virtual content, or holograms, onto a view of the real world without the need for predefined markers provides great convenience and flexibility. However, unwanted hologram movement frequently occurs in markerless smartphone AR due to challenging visual conditions or device movement, and resulting error in device pose tracking. We develop a method for measuring hologram positional errors on commercial smartphone markerless AR platforms, implement it as an open-source AR app, HoloMeasure, and use the app to conduct systematic quantitative characterizations of hologram stability across 6 different user actions, 3 different smartphone models, and over 200 different environments. Our study demonstrates significant levels of spatial instability in holograms in all but the simplest settings, and underscores the need for further enhancements to pose tracking algorithms for smartphone-based markerless AR.

CVJul 27, 2025
Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

Yanming Xiu, Maria Gorlatova

The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR, where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM, which consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect the attacks in the dataset, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system achieves an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

CVJan 21, 2025
Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

Lin Duan, Yanming Xiu, Maria Gorlatova

Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.

CVAug 7, 2025
A Neurosymbolic Framework for Interpretable Cognitive Attack Detection in Augmented Reality

Rongqian Chen, Allison Andreyev, Yanming Xiu et al.

Augmented Reality (AR) enriches perception by overlaying virtual elements on the physical world. Due to its growing popularity, cognitive attacks that alter AR content to manipulate users' semantic perception have received increasing attention. Existing detection methods often focus on visual changes, which are restricted to pixel- or image-level processing and lack semantic reasoning capabilities, or they rely on pre-trained vision-language models (VLMs), which function as black-box approaches with limited interpretability. In this paper, we present CADAR, a novel neurosymbolic approach for cognitive attack detection in AR. It fuses multimodal vision-language inputs using neural VLMs to obtain a symbolic perception-graph representation, incorporating prior knowledge, salience weighting, and temporal correlations. The model then enables particle-filter based statistical reasoning -- a sequential Monte Carlo method -- to detect cognitive attacks. Thus, CADAR inherits the adaptability of pre-trained VLM and the interpretability and reasoning rigor of particle filtering. Experiments on an extended AR cognitive attack dataset show accuracy improvements of up to 10.7% over strong baselines on challenging AR attack scenarios, underscoring the promise of neurosymbolic methods for effective and interpretable cognitive attack detection.

LGAug 7, 2025
Will You Be Aware? Eye Tracking-Based Modeling of Situational Awareness in Augmented Reality

Zhehan Qu, Tianyi Hu, Christian Fronk et al.

Augmented Reality (AR) systems, while enhancing task performance through real-time guidance, pose risks of inducing cognitive tunneling-a hyperfocus on virtual content that compromises situational awareness (SA) in safety-critical scenarios. This paper investigates SA in AR-guided cardiopulmonary resuscitation (CPR), where responders must balance effective compressions with vigilance to unpredictable hazards (e.g., patient vomiting). We developed an AR app on a Magic Leap 2 that overlays real-time CPR feedback (compression depth and rate) and conducted a user study with simulated unexpected incidents (e.g., bleeding) to evaluate SA, in which SA metrics were collected via observation and questionnaires administered during freeze-probe events. Eye tracking analysis revealed that higher SA levels were associated with greater saccadic amplitude and velocity, and with reduced proportion and frequency of fixations on virtual content. To predict SA, we propose FixGraphPool, a graph neural network that structures gaze events (fixations, saccades) into spatiotemporal graphs, effectively capturing dynamic attentional patterns. Our model achieved 83.0% accuracy (F1=81.0%), outperforming feature-based machine learning and state-of-the-art time-series models by leveraging domain knowledge and spatial-temporal information encoded in ET data. These findings demonstrate the potential of eye tracking for SA modeling in AR and highlight its utility in designing AR systems that ensure user safety and situational awareness.

CRMay 17, 2023
PrivaScissors: Enhance the Privacy of Collaborative Inference through the Lens of Mutual Information

Lin Duan, Jingwei Sun, Yiran Chen et al.

Edge-cloud collaborative inference empowers resource-limited IoT devices to support deep learning applications without disclosing their raw data to the cloud server, thus preserving privacy. Nevertheless, prior research has shown that collaborative inference still results in the exposure of data and predictions from edge devices. To enhance the privacy of collaborative inference, we introduce a defense strategy called PrivaScissors, which is designed to reduce the mutual information between a model's intermediate outcomes and the device's data and predictions. We evaluate PrivaScissors's performance on several datasets in the context of diverse attacks and offer a theoretical robustness guarantee.

LGJun 29, 2021
UAV-assisted Online Machine Learning over Multi-Tiered Networks: A Hierarchical Nested Personalized Federated Learning Approach

Su Wang, Seyyedali Hosseinalipour, Maria Gorlatova et al.

We investigate training machine learning (ML) models across a set of geo-distributed, resource-constrained clusters of devices through unmanned aerial vehicles (UAV) swarms. The presence of time-varying data heterogeneity and computational resource inadequacy among device clusters motivate four key parts of our methodology: (i) stratified UAV swarms of leader, worker, and coordinator UAVs, (ii) hierarchical nested personalized federated learning (HN-PFL), a distributed ML framework for personalized model training across the worker-leader-core network hierarchy, (iii) cooperative UAV resource pooling to address computational inadequacy of devices by conducting model training among the UAV swarms, and (iv) model/concept drift to model time-varying data distributions. In doing so, we consider both micro (i.e., UAV-level) and macro (i.e., swarm-level) system design. At the micro-level, we propose network-aware HN-PFL, where we distributively orchestrate UAVs inside swarms to optimize energy consumption and ML model performance with performance guarantees. At the macro-level, we focus on swarm trajectory and learning duration design, which we formulate as a sequential decision making problem tackled via deep reinforcement learning. Our simulations demonstrate the improvements achieved by our methodology in terms of ML performance, network resource savings, and swarm trajectory efficiency.