CVAINov 16, 2025

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

arXiv:2511.12676v1
Originality Incremental advance
AI Analysis

This addresses the problem of practical embodied AI for infrastructure inspection professionals, though it appears incremental as it builds on existing EQA frameworks with a new domain-specific benchmark.

The paper tackles the challenge of deploying embodied agents for real-world question answering by proposing BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs grounded in professional bridge inspection reports across 200 real-world scenes, and introduces EMVR, a method that formulates inspection as sequential navigation over an image-based scene graph, showing strong performance over baselines with substantial gaps revealed in state-of-the-art models.

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks that faithfully capture practical operating conditions. We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. Questions require synthesizing visual evidence across multiple images and aligning responses with NBI condition ratings. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps under episodic memory EQA settings. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates inspection as sequential navigation over an image-based scene graph: images are nodes, and an agent takes actions to traverse views, compare evidence, and reason within a Markov decision process. EMVR shows strong performance over the baselines. We publicly release both the dataset and code.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes