SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
This work addresses the critical lack of surgical reasoning capabilities in current surgical AI by providing a novel dataset and models, which is significant for improving the intelligence and safety of surgical assistance systems.
The authors introduce SUREON, a large-scale video QA dataset with 206.8k QA pairs across 134.7K clips and 170 procedure types, designed to capture surgical reasoning from expert narrations in academic videos. They also developed SureonVLM and SureonVLM-R1, which achieve over 84% accuracy on the SUREON benchmark, outperforming larger general-domain models and demonstrating explicit reasoning behavior.
Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.