CVAIApr 17

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

arXiv:2604.1091678.8h-index: 39
Predicted impact top 30% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing autonomous ultrasound systems, this benchmark fills a gap by evaluating dynamic procedural understanding rather than static images.

The paper introduces ReXSonoVQA, a video QA benchmark with 514 clips/questions for procedure-centric ultrasound understanding. Zero-shot evaluation of VLMs shows they can extract some procedural information but struggle with troubleshooting, exposing limitations in causal reasoning.

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes