CV AINov 16, 2025

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere

arXiv:2511.12676v13.6

Originality Incremental advance

AI Analysis

This addresses the problem of practical embodied AI for infrastructure inspection professionals, though it appears incremental as it builds on existing EQA frameworks with a new domain-specific benchmark.

The paper tackles the challenge of deploying embodied agents for real-world question answering by proposing BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs grounded in professional bridge inspection reports across 200 real-world scenes, and introduces EMVR, a method that formulates inspection as sequential navigation over an image-based scene graph, showing strong performance over baselines with substantial gaps revealed in state-of-the-art models.

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

View on arXiv PDF

Similar