CV AIApr 17

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar

arXiv:2604.1091678.8h-index: 39

Predicted impact top 30% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers developing autonomous ultrasound systems, this benchmark fills a gap by evaluating dynamic procedural understanding rather than static images.

The paper introduces ReXSonoVQA, a video QA benchmark with 514 clips/questions for procedure-centric ultrasound understanding. Zero-shot evaluation of VLMs shows they can extract some procedural information but struggle with troubleshooting, exposing limitations in causal reasoning.

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

View on arXiv PDF

Similar